Definitions (Writing a static analyser for PHP in Rust)

In this part of the series, we'll start working on the definition collection process for the analyser. We'll learn about the parser, visitors and more.

Let's start by installing the parser. We'll be using the PXP parser since it is a superset and supports all of PHP as well as extended language features for PXP. It's also less work to use this parser now, as opposed to using the regular PHP one and changing to the PXP one in the future.

cargo add pxp-parser --git https://github.com/pxp-lang/parser --branch main

The PXP parser hasn't had official releases at this point in time, so we'll pull it from GitHub.

Discovering files

This early in development, we can make some basic assumptions to make things easier. In this instance, we'll assume that the command is being executed from the root of a project, such that ./vendor is a valid path.

With this assumption in mind, to start discovering files, we need to recursively search the current directory for PHP files. Thankfully I've already developed a crate to do just this, so it's time to get it installed.

cargo install discoverer

It's also time to make a new module in the project to store any code related to definitions and their discovery. The new file will be src/definitions/mod.rs.

To get a list of file paths, we just need to call the discoverer::discover() function. Let's do that inside of the analyse::run() function.

let files = discoverer::discover(&["php"], &["."]).unwrap();

We're looking for php files in the current directory (.). There's a lack of error handling here but that's okay for now.

Parsing files

Before we start writing the DefinitionCollector logic, we need actually parse the files we've discovered.

let files = discoverer::discover(&["php"], &["."]).unwrap();

for file in files {
    let contents = std::fs::read(&file).unwrap();
    let ast = pxp_parser::parse(&contents).unwrap();
}

The initial implementation of the analyser is going to be purely synchronous, so we don't need to worry about processing files in parallel or anything like that. We can write incredibly crude code and come back to optimise later.

The pxp_parser::parse() function essentially accepts anything that can be converted into &[u8]. In our case, we're reading the file into a Vec<u8> which can be converted into a slice. The parser will do its thing and return a Result<Vec<Statement>, ParseError>. Remember, there's no error handling just yet so we'll .unwrap() and assume everything is okay.

`DefinitionCollector`

Now that we've got an AST for a given file, we can start to collect definitions. This will be handled by a single structure named DefinitionCollector. The AST for each file will be passed along to a method on the structure and the nodes will be analysed.

If you remember from the overview post, there are a couple of things that the DefinitionCollector needs to keep track of - the current namespace, as well as any names imported by a use statement.

We can define the new structure like so:

use pxp_parser::lexer::byte_string::ByteString;

#[derive(Debug)]
pub struct DefinitionCollector {
    current_namespace: ByteString,
    imported_names: Vec<ByteString>,
}

Note: a small implementation detail here is that the parser doesn't store strings as String or &str. PHP source code doesn't need to be valid UTF-8. To workaround this, the parser uses a custom ByteString structure that stores characters in a Vec<u8>.

The first method the collector needs is a constructor, typically called new.

impl DefinitionCollector {
    pub fn new() -> Self {
        Self {
            current_namespace: ByteString::default(),
            imported_names: Vec::new(),
        }
    }
}

We'll initialise the current_namespace to an empty ByteString, this way we don't need to worry about unwrapping an Option and can always safely prefix a name with an empty string. The imported_names field will be empty to start with too.

While we're on the topic of constructors, we might as well create a new instance of the DefinitionCollector in the command's function and store it in a mutable variable.

use statan::definitions::collector::DefinitionCollector;

use crate::AnalyseCommand;

pub fn run(command: AnalyseCommand) {
    let files = discoverer::discover(&["php"], &["."]).unwrap();
    let mut collector = DefinitionCollector::new();

    for file in files {
        let contents = std::fs::read(&file).unwrap();
        let ast = pxp_parser::parse(&contents).unwrap();
    }
}

To actually do the collection, we'll add a scan() method to the collector as well and pass in a mutable reference to the AST. The reference needs to be mutable since that is what the parser's visitor API expects.

impl DefinitionCollector {
    pub fn new() -> Self {
        Self {
            current_namespace: ByteString::default(),
            imported_names: Vec::new(),
        }
    }

    pub fn scan(&mut self, ast: &mut Vec<Statement>) {
        
    }
}

pub fn run(command: AnalyseCommand) {
    let files = discoverer::discover(&["php"], &["."]).unwrap();
    let mut collector = DefinitionCollector::new();

    for file in files {
        let contents = std::fs::read(&file).unwrap();
        let mut ast = pxp_parser::parse(&contents).unwrap();
        
        collector.scan(&mut ast);
    }
}

Next up is traversing the given AST. The parser provides a Visitor trait that we can implement on the DefinitionCollector, the scan method will just defer everything to the visitor.

impl Visitor<()> for DefinitionCollector {
    fn visit(&mut self, node: &mut dyn Node) -> Result<(), ()> {
        Ok(())
    }
}

If you're paying attention to the code, you'll see that the visit() method receives &dyn mut Node. The parser has a special Node API that is used for various types of nodes in the AST, including statements and expressions. The benefit of using this API is that we don't need to write individual methods to handle Statement, Expression, Identifier, etc. We can use a single method and cast the node to a specific form when necessary.

We first want to check if the current node is a Namespace. We can use the parser's downcast() function to handle this, passing in a generic parameter as the target type for the cast.

impl Visitor<()> for DefinitionCollector {
    fn visit(&mut self, node: &mut dyn Node) -> Result<(), ()> {
        if let Some(BracedNamespace { name: Some(SimpleIdentifier { value, .. }), .. }) = downcast::<BracedNamespace>(node) {
            let mut namespace = ByteString::from(b"\\");
            namespace.extend(&value.bytes);
            self.current_namespace = namespace;
        } else if let Some(UnbracedNamespace { name: SimpleIdentifier { value, .. }, .. }) = downcast::<UnbracedNamespace>(node) {
            let mut namespace = ByteString::from(b"\\");
            namespace.extend(&value.bytes);
            self.current_namespace = namespace;
        }

        Ok(())
    }
}

We can take advantage of Rust's deep pattern matching capabilities to reach down into the fields on these structures and reduce some of the nested if statements that would otherwise be required. We also need to check against 2 different types of Node here since PHP has 2 ways of defining a namespace.

// Unbraced syntax
namespace Foo;

// Braced syntax
namespace Foo {
    // ...
}

With the brace syntax, it's also possible to omit the namespace's name altogether and explicitly scope the block to the global namespace. If that wasn't the case, we could match against a single pattern instead.

To make the namespaces fully-qualified, we also need to prepend a single \ to the name.

PHP also has a couple of ways to import names.

// Single use statement.
use Foo\Bar;

// Multi-use statement.
use Foo\Bar, Foo\Baz;

// Grouped use statement.
use Foo\{Bar, Baz};

You can also specifically import function and const types, but the definition collector doesn't really care about those because it's not possible to use those as parameter types, return types, etc, but the variance in syntax will mean we need to handle a couple of different statement types again.

if let Some(GroupUseStatement { prefix, uses, .. }) = downcast::<GroupUseStatement>(node) {
    // ...           
}

if let Some(UseStatement { uses, .. }) = downcast::<UseStatement>(node) {
    // ...
}

Both of these statements contain a Vec<Use>, the only extra bit of logic needed for GroupUseStatement is prefixing. It should be easy to start putting names into the list of imported_names.

if let Some(GroupUseStatement { prefix, uses, .. }) = downcast::<GroupUseStatement>(node) {
    for Use { name, .. } in uses {
        let mut prefixed_name = prefix.value.clone();
        prefixed_name.extend(b"\\");
        prefixed_name.extend(&name.value.bytes);

        self.imported_names.push(prefixed_name);
    }
}

if let Some(UseStatement { uses, .. }) = downcast::<UseStatement>(node) {
    for Use { name, .. } in uses {
        let mut qualified_name = ByteString::from(b"\\");
        qualified_name.extend(&name.value.bytes);
        self.imported_names.push(qualified_name);
    }
}

Definitions

There's a couple of ways we could structure the definitions themselves. We could create a Definition enumeration that has members for each type of definitions, or we could create separate structures for each type of definition and store them in separate fields.

I personally prefer the second option since it should improve lookup time by reducing the number of items that needs to be filtered through. Let's start with the simplest definition, a FunctionDefinition.

#[derive(Debug)]
pub struct FunctionDefinition {
    pub name: ByteString,
    pub parameters: Vec<Parameter>,
    pub return_type: Option<Type>,
}

Functions have parameters, so we'll need a structure for that too.

#[derive(Debug)]
pub struct Parameter {
    pub name: ByteString,
    pub type_: Option<Type>,
}

As well as a Type enumeration that will represent all the types that are analyser understands. The parser does have it's own Type enum too, but it won't support everything we need at this point.

#[derive(Debug)]
pub enum Type {
    String,
    Int,
    Float,
    Array,
    Mixed,
    Bool,
    Named(ByteString),
}

The Type enumeration will only support PHP's most basic types for now. Adding support for unions, intersections and more importantly generics will be easy to do later on.

We'll also need somewhere to store the definitions. Let's create a DefinitionCollection.

#[derive(Debug)]
pub struct DefinitionCollection {
    functions: Vec<FunctionDefinition>,
}

The DefinitionCollector can now hold a single DefinitionCollection and we'll start to push add some items to it.

#[derive(Debug)]
pub struct DefinitionCollector {
    current_namespace: ByteString,
    imported_names: Vec<ByteString>,
    collection: DefinitionCollection,
}

impl DefinitionCollector {
    pub fn new() -> Self {
        Self {
            current_namespace: ByteString::default(),
            imported_names: Vec::new(),
            collection: DefinitionCollection::default(),
        }
    }

    // ...
}

To make the API a little nicer, we can define an add_function() method on the collection.

impl DefinitionCollection {
    pub fn add_function(&mut self, function: FunctionDefinition) {
        self.functions.push(function);
    }
}

In the collector, we need to check if the current node is a FunctionStatement. If it is, we can start to add information to a FunctionDefinition and store it inside of the collection.

if let Some(FunctionStatement { name, parameters, return_type, .. }) = downcast::<FunctionStatement>(node) {
    let name = self.qualify_name(&name.value);
    let parameters = parameters.parameters.inner.iter().map(|p| {
        Parameter {
            name: p.name.name.clone(),
            type_: self.map_type(p.data_type.as_ref()),
        }
    }).collect::<Vec<Parameter>>();
    let return_type = if let Some(ReturnType { data_type, .. }) = return_type {
        self.map_type(Some(data_type))
    } else {
        None
    };

    self.collection.add_function(FunctionDefinition { name, parameters, return_type })
}

We first convert the function's name into a qualified name. The list of function parameters are then converted into Parameter values. If the function has a return type, we also map that into the analyser's own Type enumeration.

A set of additional utility methods have been added as well - qualify_name(), resolve_name() and map_type().

qualify_name() simply prepends the given string with the current namespace, producing a fully-qualified name.

fn qualify_name(&self, name: &ByteString) -> ByteString {
    let mut qualified_name = self.current_namespace.clone();
    qualified_name.extend(b"\\");
    qualified_name.extend(&name.bytes);

    qualified_name
}

resolve_name() is a little more complex and handles name resolution. This isn't as simple as doing a lookup on the list of imported_names, since PHP supports a few different ways of referencing names.

You can import a single named item and reference it:

use Closure;

function invoke(Closure $closure) {}

You can also reference a fully-qualified name:

function invoke(\Closure $closure) {}

You can even import a namespace and use a qualified name.

use App\Models;

function do_something(Models\User $user) {}

And finally, if an imported name doesn't match the given identifier, we assume you're referencing something in the current namespace. The resolve_name() method handles these edge-cases.

fn resolve_name(&self, name: &ByteString) -> ByteString {
    // If the name is already fully qualified, return as is.
    if name.bytes.starts_with(b"\\") {
        return name.clone();
    }

    let parts = name.split(|b| *b == b'\\').collect::<Vec<&[u8]>>();
    let first_part = parts.first().unwrap();

    // Check each imported name to see if it ends with the first part of the
    // given identifier. If it does, we can assume you're referencing either
    // an imported namespace or class that has been imported.
    for imported_name in self.imported_names.iter() {
        if imported_name.ends_with(&first_part) {
            let mut qualified_name = imported_name.clone();
            qualified_name.extend(&name.bytes[first_part.len()..]);

            return qualified_name;
        }
    }

    // If we've reached this point, we have a simple name that
    // is not fully qualified and we have not imported it.
    // We can simply prepend the current namespace to it.
    let mut qualified_name = self.current_namespace.clone();
    qualified_name.extend(b"\\");
    qualified_name.extend(&name.bytes);

    qualified_name
}

The final utility function, map_type(), takes an Option<ParsedType> and converts it into the analyser's own Type enumeration.

match data_type {
    Some(t) => Some(match t {
        ParsedType::Named(_, name) => Type::Named(self.resolve_name(name)),
        ParsedType::Float(_) => Type::Float,
        ParsedType::Boolean(_) => Type::Bool,
        ParsedType::Integer(_) => Type::Int,
        ParsedType::String(_) => Type::String,
        ParsedType::Array(_) => Type::Array,
        ParsedType::Mixed(_) => Type::Mixed,
        _ => todo!(),
    }),
    None => None,
}

Note: There's plenty of room for optimisation in these methods but they serve their purpose right now and will allow us to move on. It's a waste of time getting bogged down in optimisation tricks this early in a project.

Now that we have some functions being stored in the collection, it makes sense to add a collect() method to the collector that returns the DefinitionCollection we've been working with.

pub fn collect(&self) -> DefinitionCollection {
    self.collection.clone()
}

Instead of messing with lifetimes and references right now, we'll use .clone(). It's not a difficult task to come back and change this later on to improve performance and memory usage.

We also need to update the scan() method to actually start traversing the AST. We can loop over the AST and pass each statement along to the Visitor implementation.

pub fn scan(&mut self, ast: &mut Vec<Statement>) {
    self.current_namespace = ByteString::default();
    self.imported_names = Vec::new();

    for statement in ast.iter_mut() {
        self.visit_node(statement).unwrap();
    }
}

It's important that we reset the current_namespace and imported_names. The DefinitionCollector won't be unique to a file so we need to reset the state between scans.

If we create 2 separate PHP files, a.php and b.php, each containing a function with their respective names and run the analyser on just one of those files, we can dump out the collection and see that both functions were discovered by the collector.

use App\Foo;

function a(Foo $foo) {}

use App\Models;

function b(Models\User $user, \Closure $closure) {}

cargo run -- analyse ./playground/a.php

We get the following output from the debug dump:

[src/cmd/analyse.rs:17] collection = DefinitionCollection {
    functions: [
        FunctionDefinition {
            name: "\b",
            parameters: [
                Parameter {
                    name: "$user",
                    type_: Some(
                        Named(
                            "\App\Models\User",
                        ),
                    ),
                },
                Parameter {
                    name: "$closure",
                    type_: Some(
                        Named(
                            "\Closure",
                        ),
                    ),
                },
            ],
            return_type: None,
        },
        FunctionDefinition {
            name: "\a",
            parameters: [
                Parameter {
                    name: "$foo",
                    type_: Some(
                        Named(
                            "\App\Foo",
                        ),
                    ),
                },
            ],
            return_type: None,
        },
    ],
}

Exactly what we expected! There are two functions in the collection and all of the parameters have been picked up correctly.

The name resolution is working too. Names referencing partially imported namespaces are fully-qualified, explicit imports are qualified and references to already fully-qualified names are resolved correctly.

Next steps

The next thing to do is replicate the logic above and do the same for classes, traits, interfaces and enumerations. It's pretty repetitive so I won't include the code in this post, but if you're interested in what the result looks like, all of the code for this part is available on GitHub.