Ryan Chandler

Writing a Static Analyser for PHP in Rust - Overview

10 min read

In this new series, I'd like to go over the process of writing a static analyser for PHP in the Rust programming language. We'll start by talking about what a static analyser is, the set of generic processes the program takes and how to actually write one in Rust.

This post will cover the basics of static analysis and layout the plan for our own engine.

What is static analysis?

Static analysis is the process of analysing source code by examining its structure and syntax, without executing the program. This type of analysis can be used to find bugs, security vulnerabilities, and other issues in your code prior to execution.

How does a static analyser work?

The majority of static analysis tools will use an abstract syntax tree (AST) that represents the lexical structure of your code. Let's take the following PHP code and produce a pseudo abstract syntax tree.

function name(): string {
    return "Ryan";
}

$name = name();

A parser will take this string of code and produce something similar to the structure below:

FunctionStatement {
    name: "name",
    parameters: [],
    return_type: "string",
    body: [
        ReturnStatement {
            value: String("Ryan"),
        }
    ],
},
ExpressionStatement {
    expression: Assign {
        target: Variable("$name"),
        value: CallExpression {
            target: Identifier("name"),
            arguments: [],
        }
    }
}

By obtaining this reusable and definitive representation of the code, a program is able to reliably look at certain properties and fields during analysis.

Finding definitions

A "definition" is any form of structure in your code that can be referenced, instantiated or invoked. In PHP's case, that would be normally be a class or function. PHP also has traits, interfaces and enumerations which would be discovered too.

The analyser will find every single PHP file in your project, including any vendor files, then send each one through some form of definition collector. The job of the collector is to parse the code, analyse the AST and find one of the structure mentioned above.

Once a definition has been found, it will be further analysed to pull out any information the analyser needs. Let's take a look at an example from a fake Laravel codebase.

namespace App\Models;

use Illuminate\Database\Eloquent\Model;
use Illuminate\Database\Eloquent\Relations\BelongsTo;
use App\Models\User;

class Post extends Model 
{
    /** @return BelongsTo<User> */
    public function user(): BelongsTo
    {
        return $this->belongsTo(User::class);
    }
}

The collector will parse this file and start to recursively look at each statement.

It first encounters the namespace, taking the name and storing it to use later on. Then the use statements are seen and each of those references are temporarily stored so that fully-qualified names (FQNs) can be resolved when necessary.

Once it reaches the class statement, the collector starts to do some real work. The first thing it does is grab the name of the class and turn it into a fully-qualified name. That's done by prefixing the class name with the current namespace, i.e. App\Models\User.

Since the class inherits from Model, it also needs to resolve the fully-qualified name for that too. The first check is for any imported classes that have the name Model, which works in this case.

Whilst the collector is here, it can also start to build up some information about the methods, properties and constants on the class. That's all pretty self explanatory and follows the same general pattern as above. The result of that work is a near-flat structure of all definitions across the codebase.

[
    "App\\Models\\Class" => Class {
        name: "App\\Models\\Class",
        extends: "Illuminate\\Database\\Eloquent\\Model",
        methods: [
            "user" => Method {
                name: "user",
                visibility: Public,
                parameters: [],
                return_type: "Illuminate\\Database\\Eloquent\\Relations\\BelongsTo",
            }
        ]
    }
]

Note: the collector doesn't do any analysis of the code. It won't check for valid parent classes, return types, etc. It's purely a collection phase to find out what is defined in a project's codebase.

User code

As mentioned above, the collector will analyse your entire codebase, including any third-party dependencies found in the vendor folder. This is essential to ensure the analyser has all of the information it needs.

When it comes to actually analysing your own code though, the analyser doesn't need to waste time analysing the third-party stuff, so instead it will skip over vendor and any other ignored paths and focus solely on the code that you've written.

Once again, the analyser will build up a list of files in your project and start to analyse each one independently. Most engines will provide a "rule" API that is used to separate out the logic for each check in the analyser.

The rule itself will respond to a particular type of node in the AST and only execute its logic on that node. Let's write a rule with some pseudo PHP code to check that the arguments passed to a function are the correct type so that we can explore the various APIs available to a rule.

function add(int $a, int $b): int {
    return $a + $b;
}

add("one", 2);

Let's pretend the AST produces a CallExpression for the function call to add(). Our rule will need to tell the analyser that it only cares about that particular type of expression.

class FunctionArgumentRule implements Rule
{
    public function shouldRun(Node $node): bool
    {
        return $node instanceof CallExpression;
    }

    public function run(Node $node, Scope $scope, Reporter $reporter): void
    {
        // Analysing calls to anonymous function requires a little
        // more work, so we'll skip those for now.
        if (! $node->target instanceof SimpleIdentifier) {
            return;
        }

        $definition = $scope->getFunction($node->target);

        // In the case where we can't find the definition,
        // we can return early to avoid any errors below.
        // A separate rule will handle these invalid calls.
        if ($definition === null) {
            return;
        }

        $parameters = $definition->getParameters();

        if (count($node->arguments) < count($parameters)) {
            $reporter->report(
                sprintf('Function %s expects %d arguments, only got %d.', $node->target->value, count($parameters), count($node->arguments))
            );
        }

        // As this is just an example, we'll assume that all 
        // of the arguments being passed to the function are
        // positional and ignore any named arguments.
        foreach ($definition->getParameters() as $position => $parameter) {
            $parameterType = $parameter->getType();

            if ($parameterType === null || $parameterType->isMixed()) {
                continue;
            }

            $argument = $node->arguments[$position];
            $argumentType = $scope->getTypeOfExpression($argument);

            if (! $parameterType->isCompatibleWith($argumentType)) {
                $reporter->report(
                    sprintf('Argument #%d (%s) must be of type %s, got %s.', $position, $parameter->getName(), $parameterType->stringify(), $argumentType->stringify())
                );
            }
        }
    }
}

With this pseudo-analyser API, you can begin to see how the analyser works under-the-hood.

The rule will receive some sort of "scope" or "environment" object that stores information about where the node is located. It holds informations about the variables in the scope, the types of variables and can be used to find definitions.

Searching for definitions is also important since that will be at the core of most typechecking rules. The scope will have a reference to the definitions collected earlier on in the process and be able to do arbitrary lookups when necessary.

It will first check to see if a function exists in the current namespace, e.g. in the namespace App, a call to function foo() will first check for a function with the fully-qualified name App\foo.

If that function doesn't exist, it will then check for any imported functions with the name foo, e.g. use function Package\foo. Finally, it will look for a function defined in the global namespace called foo - this is normally where a lookup will land when analysing native PHP functions.

Most analysers will have a standardised API for representing types as well. There could be dedicated objects for PHP's scalar types (string, int, etc) and then more complex objects for generic types like Collection<T> or array<K, V>. It's important that these types are able to validate their own compatibility with another type object and be able to handle inheritance, etc.

Another detail in the pseudo-implementation above is the Reporter. This object is purely used for reporting issues / errors back to the user and will generally be scoped to the current file being analysed.

Why write another static analyser?

With a lot of existing art in this area (PHPStan, Psalm, Phan, etc), you might be wondering to yourself "Why write another static analyser?"... that's an excellent question! There's been a huge boom in the JavaScript ecosystem over the last few years with lots of new tools being written in lower-level languages such as Rust, Go and Zig. The same boom is yet to happen in the PHP ecosystem and I'd like to be one of the first catalysts in that movement.

By choosing a language such as Rust for a tool, you can bring memory safety and excellent performance to the table. PHP has improved a lot in the performance department over the years, but it will never reach the speeds of a systems programming language.

Alongside the speed and memory safety, I'm also developing a superset of PHP called PXP. Part of this project involved writing a PHP parser in Rust and an extension of that parser to support the new superset. Given I already have a huge chunk of the work done, it makes sense to continue on this path and write more tools in Rust, not only as a learning experience for me but for all readers too.

The static analyser itself will play a big part in PXP's transpiler and allow us to do all sort of crazy things.

I'm not a Rust expert, but I've been using it for long enough now that I'm able to take on the challenge of writing these larger, complex systems.

The plan for this analysis engine

The rest of this series will cover the various phases in writing a static analyser. Before I get too far into the project itself, it's a good idea to come up with a plan of action and figure out the milestones and goals.

The initial goal for the analysis engine it to replicate PHPStan's lowest level of analysis (configuration on GitHub). This includes the following:

  • Validating function calls (function exists, number of arguments, argument types)
  • Validating class instantiation (class exists, number of constructor arguments, argument types)
  • Validating inheritance (ensuring parent class exists, abstract methods implemented)
  • Validating interface implementations (ensuring interface exists, contract methods implements)
  • Validating enums (missing backed type, validating case values, no abstract methods, no __toString())
  • Validating method calls (static and instance methods, number of arguments, argument types)
  • Validating mathematical operations between invalid types, e.g. string + int.

All development will be open-source and available on GitHub. A link to the repository will be provided in the next post.

Epilogue

I'm excited to start working on this project and see where it goes. I hope that you'll learn a thing or two about static analysis and Rust, perhaps you'll even contribute to the project at some point.

Posts will likely come out once a week to allow for development time in between, as well as any feedback from readers and those interested.