Writing a Static Analyser for PHP in Rust

Now that we have a basic Rule API setup, we can start to implement some basic analysis rules. We're going to start with a simple ValidFunctionRule that checks all function calls to ensure the function actually exists.

The rule itself is relatively simple. We need to create a new struct and implement the Rule trait on it.

#[derive(Debug)]
pub struct ValidFunctionRule;

impl Rule for ValidFunctionRule {
    fn should_run(&self, node: &dyn Node) -> bool {
        downcast::<FunctionCallExpression>(node).is_some()
    }

    fn run(&mut self, node: &mut dyn Node, definitions: &DefinitionCollection) {
        todo!()
    }
}

The rule will only be looking at FunctionCallExpression nodes, so we can limit that down inside of should_run().

The run() method can then execute and do the existence check on the FunctionCallExpression.target.

fn run(&mut self, node: &mut dyn Node, definitions: &DefinitionCollection) {
    let function_call_expression = downcast::<FunctionCallExpression>(node).unwrap();

    match function_call_expression.target.as_ref() {
        Expression::Identifier(Identifier::SimpleIdentifier(SimpleIdentifier { value: function_name, .. })) => {
            if definitions.get_function(function_name).is_none() {
                todo!()
            }
        },
        _ => return,
    }
}

Unwrapping the result from downcast() is safe since we have already limited down the node types inside of should_run(). To get rid of the todo!() in the method, we need to update the Rule to also accept the MessageCollector that we created in the previous part.

pub trait Rule: Debug {
    fn should_run(&self, node: &dyn Node) -> bool;
    fn run(&mut self, node: &mut dyn Node, definitions: &DefinitionCollection, messages: &mut MessageCollector);
}

Now the rule can add a message to the collector.

fn run(&mut self, node: &mut dyn Node, definitions: &DefinitionCollection, messages: &mut MessageCollector) {
    let function_call_expression = downcast::<FunctionCallExpression>(node).unwrap();

    match function_call_expression.target.as_ref() {
        Expression::Identifier(Identifier::SimpleIdentifier(SimpleIdentifier { value: function_name, .. })) => {
            if definitions.get_function(function_name).is_none() {
                messages.add(format!("Function `{}` not found", function_name));
            }
        },
        _ => return,
    }
}

The rule itself can be registered with the analyser.

fn run(args: AnalyseCommand) {
    // ...

    let mut analyser = Analyser::new(collection);
    analyser.add_rule(Box::new(rules::functions::valid_function::ValidFunctionRule));

    // ...
}

A simple PHP file with the code below will produce the following messages:

<?php

foo();

cargo run -- analyse ./playground/invalid-function.php

[src/cmd/analyse.rs:29] messages = MessageCollector {
    file: "./playground/invalid-function.php",
    messages: [
        "Function `foo` not found",
    ],
}

Since we've got some real messages, we might as well spend a couple of minutes improving the format of the output in the terminal. Let's go with tables for now and pull in the prettytable-rs crate to do the heavy lifting.

cargo add prettytable-rs

impl MessageCollector {
    // ...

    pub fn iter(&self) -> Iter<String> {
        self.messages.iter()
    }

    pub fn get_file(&self) -> &str {
        self.file.as_str()
    }

    // ...
}

fn run(args: AnalyseCommand) {
    // ...

    let messages = analyser.analyse(args.file, &contents);

    let mut table = Table::new();
    table.add_row(row![messages.get_file()]);
    for message in messages.iter() {
        table.add_row(row![message]);
    }
    table.printstd();
}

And now the output looks like this:

+-----------------------------------+
| ./playground/invalid-function.php |
+-----------------------------------+
| Function `foo` not found          |
+-----------------------------------+

Much nicer than Rust's default dbg!() output.

Resolving Names

The rule that we've just written is producing an error, but it's not entirely correct. If the PHP code is updated to the following:

<?php

function foo() {
    
}

foo();

The analyser is still going to produce an error. The definition collector is saving a FunctionDefinition for foo() but it is storing it with a fully-qualified name \foo.

We need to take the same resolve_name() logic from the DefinitionCollector and add it to the Analyser somewhere. We can't simply add it to the Analyser because that will be used for multiple files, so instead we need to create some sort of Scope or Context structure that we can use for a single file.

I like the name Context, so let's go with that.

#[derive(Debug, Clone)]
pub struct Context {
    namespace: ByteString,
    imports: Vec<ByteString>,
}

impl Context {
    pub fn new() -> Self {
        Self {
            namespace: ByteString::default(),
            imports: Vec::new(),
        }
    }

    pub fn resolve_name(&self, name: &ByteString) -> ByteString {
        // If the name is already fully qualified, return as is.
        if name.bytes.starts_with(b"\\") {
            return name.clone();
        }

        let parts = name.split(|b| *b == b'\\').collect::<Vec<&[u8]>>();
        let first_part = parts.first().unwrap();

        // Check each imported name to see if it ends with the first part of the
        // given identifier. If it does, we can assume you're referencing either
        // an imported namespace or class that has been imported.
        for imported_name in self.imports.iter() {
            if imported_name.ends_with(first_part) {
                let mut qualified_name = imported_name.clone();
                qualified_name.extend(&name.bytes[first_part.len()..]);

                return qualified_name;
            }
        }

        // If we've reached this point, we have a simple name that
        // is not fully qualified and we have not imported it.
        // We can simply prepend the current namespace to it.
        let mut qualified_name = self.namespace.clone();
        qualified_name.extend(b"\\");
        qualified_name.extend(&name.bytes);

        qualified_name
    }

    pub fn set_namespace(&mut self, namespace: ByteString) {
        self.namespace = namespace;
    }

    pub fn add_import(&mut self, import: ByteString) {
        self.imports.push(import);
    }
}

Since a single-file could have multiple contexts, we'll use a context_stack on the Analyser to store them.

#[derive(Debug)]
pub struct Analyser {
    rules: Vec<Box<dyn Rule>>,
    definitions: DefinitionCollection,
    message_collector: MessageCollector,
    context_stack: Vec<Context>,
}

impl Analyser {
    pub fn new(definitions: DefinitionCollection) -> Self {
        Self {
            rules: Vec::new(),
            definitions,
            message_collector: MessageCollector::default(),
            context_stack: Vec::new(),
        }
    }

    pub fn analyse(&mut self, file: String, contents: &[u8]) -> MessageCollector {
        self.message_collector = MessageCollector::new(file);

        let parse_result = parse(contents);
        if let Err(error) = parse_result {
            self.message_collector.add(error.to_string());
            return self.message_collector.clone();
        }

        let mut ast = parse_result.unwrap();

        self.context_stack.push(Context::new());
        self.visit_node(&mut ast).unwrap();

        return self.message_collector.clone();
    }

    pub fn add_rule(&mut self, rule: Box<dyn Rule>) {
        self.rules.push(rule);
    }
}

impl Visitor<()> for Analyser {
    fn visit(&mut self, node: &mut dyn Node) -> Result<(), ()> {
        let mut context = self.context_stack.last_mut().unwrap();

        for rule in &mut self.rules {
            if rule.should_run(node) {
                rule.run(node, &self.definitions, &mut self.message_collector, &mut context);
            }
        }

        Ok(())
    }
}

Now when we run the rule on the code from earlier, it won't produce any messages because the call to foo() is being resolved to \foo which is the fully-qualified name of the function itself.

But what if we now create two separate PHP files - one with a namespaced function and the other importing that namespaced function and calling it? Well, the analyser will fail again.

<?php

namespace App;

function foo() {

}

<?php

use function App\foo;

foo();

The first problem is that the current Context isn't storing any imported names. We can add a check for use statements in the Analyser and if we come across one, we can add it to the Context.

impl Visitor<()> for Analyser {
    fn visit(&mut self, node: &mut dyn Node) -> Result<(), ()> {
        let mut context = self.context_stack.last_mut().unwrap();

        if let Some(BracedNamespace { name: Some(SimpleIdentifier { value, .. }), .. }) = downcast::<BracedNamespace>(node) {
            let mut namespace = ByteString::from(b"\\");
            namespace.extend(&value.bytes);
            context.set_namespace(namespace);
        }

        if let Some(UnbracedNamespace { name: SimpleIdentifier { value, .. }, .. }) = downcast::<UnbracedNamespace>(node) {
            let mut namespace = ByteString::from(b"\\");
            namespace.extend(&value.bytes);
            context.set_namespace(namespace);
        }

        if let Some(GroupUseStatement { prefix, uses, .. }) = downcast::<GroupUseStatement>(node) {
            for Use { name, .. } in uses {
                let mut prefixed_name = prefix.value.clone();
                prefixed_name.extend(b"\\");
                prefixed_name.extend(&name.value.bytes);

                context.add_import(prefixed_name);
            }
        }

        if let Some(UseStatement { uses, .. }) = downcast::<UseStatement>(node) {
            for Use { name, .. } in uses {
                let mut qualified_name = ByteString::from(b"\\");
                qualified_name.extend(&name.value.bytes);
                context.add_import(qualified_name);
            }
        }

        for rule in &mut self.rules {
            if rule.should_run(node) {
                rule.run(node, &self.definitions, &mut self.message_collector, &mut context);
            }
        }

        Ok(())
    }
}

There's some code duplication from the DefinitionCollector here but we can come back to tidy this up later on. This does fix the issue with importing functions from other namespaces, which is perfect!

Native Functions

If we try to analyse the following PHP code:

abs(-1);

The analyser will currently tell us that the function abs() doesn't exist. But what's the problem? That function is a native PHP one, it should definitely exist! Well, the issue is our DefinitionCollector isn't able to detect native PHP functions because they're not defined anywhere in the PHP code of our project.

Thankfully this is something that other tools have also encountered in the past, which means there are stub PHP files available with all of PHP's native functions, classes, interfaces, etc.

PHPStan has a public repository with a collection of stubs automatically taken from PHP's codebase. There are a couple of ways we can get the analyser to scan for these files:

Embed the files inside of the binary at compile-time.
Require the stubs package as part of our project and let the DefinitionCollector handle them like regular PHP files.

I'm going to opt for the second approach to keep things simple. Embedding the files is slightly more complicated and requires using a third-party crate to embed directories, since Rust only supports embedding single-files out of the box.

A quick composer command will bring those stubs into our project.

composer require phpstan/php-8-stubs --dev

Trying to analyse that same PHP code now will expose another problem. The definition collector doesn't have support for all of PHP's types just yet. Now is a good time to start adding some more types in, so let's do that.

We first want to identify what types are missing, so instead of just doing a todo!() we can add some additional information to the output.

fn map_type(&self, data_type: Option<&ParsedType>) -> Option<Type> {
    data_type.map(|t| match t {
        ParsedType::Named(_, name) => Type::Named(self.resolve_name(name)),
        ParsedType::Float(_) => Type::Float,
        ParsedType::Boolean(_) => Type::Bool,
        ParsedType::Integer(_) => Type::Int,
        ParsedType::String(_) => Type::String,
        ParsedType::Array(_) => Type::Array,
        ParsedType::Mixed(_) => Type::Mixed,
        _ => todo!("unhandled type: {:?}", t),
    })
}

And after running the analyse command and adding any missing types, we end up with this:

#[derive(Debug, Clone)]
pub enum Type {
    String,
    Int,
    Float,
    Array,
    Mixed,
    Bool,
    Object,
    Void,
    False,
    True,
    Null,
    Callable,
    Static,
    Iterable,
    Nullable(Box<Self>),
    Named(ByteString),
    Union(Vec<Self>),
}

fn map_type(&self, data_type: Option<&ParsedType>) -> Option<Type> {
    data_type.map(|t| match t {
        ParsedType::Named(_, name) => Type::Named(self.resolve_name(name)),
        ParsedType::Float(_) => Type::Float,
        ParsedType::Boolean(_) => Type::Bool,
        ParsedType::Integer(_) => Type::Int,
        ParsedType::String(_) => Type::String,
        ParsedType::Array(_) => Type::Array,
        ParsedType::Mixed(_) => Type::Mixed,
        ParsedType::Void(_) => Type::Void,
        ParsedType::Object(_) => Type::Object,
        ParsedType::Nullable(_, data_type) => Type::Nullable(Box::new(self.map_type(Some(data_type)).unwrap())),
        ParsedType::Union(data_types) => {
            let mut types = Vec::new();

            for data_type in data_types {
                types.push(self.map_type(Some(data_type)).unwrap());
            }

            Type::Union(types)
        },
        ParsedType::False(_) => Type::False,
        ParsedType::True(_) => Type::True,
        ParsedType::Null(_) => Type::Null,
        ParsedType::Callable(_) => Type::Callable,
        ParsedType::StaticReference(_) => Type::Static,
        ParsedType::Iterable(_) => Type::Iterable,
        _ => todo!("unhandled type: {:?}", t),
    })
}

There are still some types missing, such as never and intersection types, those will be added when we need them.

Analysing the abs() example once more, no messages are added and we can now successfully analyse PHP's native functions.

Next steps

Now that we have some of the base level work done for analysis, the next part will start to focus on storing variables types inside of Context. We'll start to analyse expressions and calculate the return type of an expression based on the it's type.

All of the code for this part can be found on GitHub.

Writing a Static Analyser for PHP in Rust - Basic Rules

Resolving Names

Native Functions

Next steps