Ryan Chandler

Blazingly Fast Markdown Parsing in PHP using FFI and Rust

4 min read

Parsing and rendering Markdown as HTML is very common on the modern web. The content found on this very blog is written in Markdown and rendered to HTML using the league/commonmark package and a few plugins.

I'm currently in the process of writing a book to accompany a video course. All of the content for that book is also written in Markdown, but I've started to notice a slowdown in build speed as the number of files and size of the files increases. The biggest bottleneck in the process is in fact the conversion of Markdown to HTML.

In this post, I'll go through the steps that I took to bind the excellent comrak Rust crate to my existing PHP build process via FFI to improve the performance of my book-building tool.

What is FFI?

FFI stands for "Foreign Function Interface". When a language provides some form of FFI, it allows you to interact with functions written in a totally different programming language.

There are plenty of practical applications for FFI, for example:

  • Progressively migrating from one language to another
  • Improving mission-critical code with more performant code
  • Using packages written in other languages

In this case, we'll be using a package from another language to improve the performance of some PHP code.

The FFI class

PHP introduced support for FFI as part of the PHP 7.4 release with the addition of a new FFI class in the global namespace. The idea here is that we compile some Rust code into a shared library (.so, .dylib, .dll) and provide a C-style header declaration to let PHP know which structures and definitions are available as part of our code.

Generating a shared library

Let's say that we want to expose a Rust function that accepts 2 integers and returns the sum of those numbers.

fn add(a: i64, b: i64) -> i64 {
    return a + b;
}

To represent this Rust function in the style of a C function, we would need to define the signature of the function using C syntax.

long long add(long long a, long long b);

Assuming that the Rust project was created using cargo init --lib, we'll have a src/lib.rs file that will contain our Rust code. If the project is built using cargo build, the Rust compiler will product an rlib by default. This is a special library format specific to Rust and therefore no usable with FFI.

We need to tell the Rust compiler that we want to build a shared library. The way to do this is by adding a crate-type configuration key to the Cargo.toml file in the Rust project.

[package]
name = "php-comrak"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[dependencies]

The cdylib format will produce the shared library in a valid C-style format for most (if not all) FFI APIs. The generated file can be found in target/debug as something similar to libcratename.dylib (or .so and .dll on other platforms).

Although we have written a Rust function, it's not currently being exposed via the shared library. To expose the function, we need to tell Rust that we which to make it public and export it as a C-compatible function using the extern keyword.

pub extern "C" fn add(a: i64, b: i64) -> i64 {
    return a + b;
}

We can still use Rust's i64 integer type since it knows how to compile that to C's long long equivalent.

Now let's hook it up to PHP using FFI::cdef().

$ffi = FFI::cdef('long long add(long long a, long long b);', __DIR__ . '/target/debug/libphp_comrak.dylib');

echo $ffi->add(100, 200);

The first argument is the declaration of the function and the second argument is the path to the object file.

Running that PHP file in a terminal produces the following output:

Fatal error: Uncaught FFI\ParserException: ';' expected, got '<EOF>' at line 1 in /.../php-comrak/main.php:3
Stack trace:
#0 /.../php-comrak/main.php(3): FFI::cdef('long long add(l...', '/...')
#1 {main}
  thrown in /.../php-comrak/main.php on line 3

What happened? The Rust file correctly defines that function and the header describes the function signature...

The culprit here is name mangling. Name mangling is a process that a compiler uses to avoid conflicts with function names during linking. To prevent mangling the name of the add function, an attribute needs to be added to the function itself.

#[no_mangle]
pub extern "C" fn add(a: i64, b: i64) -> i64 {
    return a + b;
}

Re-compiling the Rust code and re-executing the PHP code should now work.

$ cargo build
$ php ./main.php
300

Adding Comrak

First thing to do is add the crate as a dependency of the project.

$ cargo add comrak

Next we can define a new function called compile which will accept a string of Markdown and compile it into HTML.

pub extern "C" fn compile(markdown: ?) -> ? {
    // ...
}

But what type do we accept and return? This is where it starts to get more involved.

If this were a regular Rust function we'd probably use one of Rust's first-party string representations, String and &str. The issue with those types is that they're not FFI-safe. Thankfully Rust provides some extra types as part of the std::ffi module.

Instead of String, we can use the std::ffi::c_char type, accepting and returning a raw pointer (*const c_char).

use std::ffi::c_char;

#[no_mangle]
pub extern "C" fn compile(markdown: *const c_char) -> *const c_char {
    // ...
}

To see if the code is working, we can just return the markdown value directly and then update our PHP code to call the new function.

#[no_mangle]
pub extern "C" fn compile(markdown: *const c_char) -> *const c_char {
    return markdown;
}
$ffi = FFI::cdef('const char* compile(const char* markdown);', __DIR__ . '/target/debug/libphp_comrak.dylib');

$markdown = <<<'md'
## Hello, world!

Here is some **bold** and _italic_ text.
md;

echo $ffi->compile($markdown);

Compiling and executing produces the following:

$ cargo build
$ php ./main.php
## Hello, world!

Here is some **bold** and _italic_ text.%

Calling the compile() function returns const char* (a constant pointer to a character), PHP's FFI handling is smart enough to convert that into a regular PHP string for us.

Now that we know the FFI part is working, let's handle the conversion of a const* c_char into a Rust &str type.

let markdown = unsafe { CStr::from_ptr(markdown) }
    .to_str()
    .unwrap();

The use of an unsafe block is crucial here. Since we're using a raw pointer (const* c_char), Rust isn't able to do it's regular compile-time guarantees regarding memory safety. We're admitting to that by using the unsafe block to construct a std::ffi::CStr.

Comrak provides a very simple markdown_to_html() function that accepts the &str stored inside of markdown, as well as some configuration options.

let html = markdown_to_html(markdown, &ComrakOptions::default());

There's no need to worry about the options right now, we'll use the sensible defaults that Comrak provides.

Now that we have a String containing the HTML, we need to convert it back into a const* c_char. That can be done using the CString structure.

CString is to CStr as String is to &str. CString represents an owned C-compatible string, whereas CStr represents a borrowed reference to an array of characters (bytes).

return CString::new(html).unwrap().into_raw();

The CString::new() constructor can fail, so it needs to be unwrapped first. Once it is, we can grab a raw pointer to the underlying data and return it directly.

Here's the final function:

use std::ffi::{c_char, CStr, CString};
use comrak::{markdown_to_html, ComrakOptions};

#[no_mangle]
pub extern "C" fn compile(markdown: *const c_char) -> *const c_char {
    let markdown = unsafe { CStr::from_ptr(markdown) }
        .to_str()
        .unwrap();

    let html = markdown_to_html(markdown, &ComrakOptions::default());
    
    return CString::new(html).unwrap().into_raw()
}

The PHP code doesn't need to be updated again since it already has the correct signature. All we need to do is re-compile the Rust code and execute the PHP code again.

$ cargo build
$ php ./main.php
<h2>Hello, world!</h2>
<p>Here is some <strong>bold</strong> and <em>italic</em> text.</p>

And there we go! We're parsing and compiling Markdown to HTML with Rust and FFI. Let's do some rudimentary benchmarks to see how it performs compared to league/commonmark.

The benchmark is focused on the actual parsing and compilation time, it doesn't include any file I/O or other irrelevant operations. Each benchmark was executed 10 times and the results you see below are the averages of those 10 runs.

Iterations league/commonmark comrak & FFI
1,000 0.070 seconds 0.012 seconds
10,000 7.250 seconds 0.030 seconds
50,000 DNF 0.985 seconds

Those benchmarks demonstrate the potential performance benefits of using FFI and a lower-level language at scale.