Architecture
OpenVet uses a microservice architecture, which allows it to split responsibilities and distribute onto multiple machines. The high-level architecture looks like this:
We will break these components down and explain in more depth what they do.
Frontend
The frontend is the main user-interface of OpenVet. It is where end-users can browse crates, crate versions, crate version contents (files), audit results, audit comments, and generated reports from the analyzer.
For end-users, the purpose of the frontend is to be able to assess quickly whether or not a crate is safe, and what the risk factors are.
Auditors can use it to create accounts, start audits, create comments on files, modules, functions as they perform their audit. By design, anything auditors do is public, events are recorded in an append-only log.
For an auditor, the purpose of the frontend is to be able to quickly get a summary of the risk factors of the crate, and allow them to read the code. Feature ideas are:
- Ability to navigate a codebase easily, for example clicking on a function call takes you to the definition of it.
- Ability to inspect autogenerated code, for example by expanding macro invocations, build script outputs, procedural macro output.
- Ability to highlight code that should be looked at more closely, for example
unsafecode, uses ofunwrap(), usage of certain APIs (process spawning).
Sync
The purpose of the sync component is to keep OpenVet's idea of which crates exist aligned with what is actually on crates.io. There is currently no way to get push-style information from crates.io for changes, only ways to poll. There are three ways to access data from crates.io:
- Poll the crates.io API
- Poll fetch the crates.io database dumps
- Poll the crates.io git index
The current implementation uses the git-index strategy, because it was the easiest to implement. Relying on the database dumps might be advantageous, as they contain more information (such as historical download counts) that is otherwise difficult to access.
Using the crates.io API directly is not a viable strategy, due to the number of requests that would need to be made.
Backend
The backend has two main purposes: to provide an API for getting data from the database, and to serve the frontend for browsers.
The API can be used to query any data stored in the database. It provides file-based access to crate contents, various kinds of crate metadata, audit results, and events.
It also provides git-based access to crate contents, allowing crates to be cloned with git. The backend exposes all published crate versions as a virtual git repository, this serves as a convenience to developers to be able to inspect crate contents locally using their toolchains.
The backend should make use of caching, both in-memory and on-file to quickly serve popular files without needing to make too many requests to storage.
Finally, the backend is responsible for serving the frontend. This involves the backend serving all assets of the frontend, but it also means that it will use server-side rendering to pre-hydrate requests, which allows search engines and clients that do not support WebAssembly to interact with the platform.
Analyzer
The purpose of the analyzer is to perform analysis on crate contents, with a focus on detecting potential issues. Some of the issues it might detect are:
- Use of
unsafe - Crate lib name does not match crate name
- Linking with native libraries
- Usage of a build script
- Usage of proc macros
- Usage of FFI
- Capabilities (network and file I/O, manual allocation, MMAP)
It will poll for crates which did not have this analysis performed, perform the analysis, and push the results to storage. These analysis results are intended as hints for crate auditors, and allow end-users to see potential issues before a crate has been thoroughly auditted.
Storage
The storage system is responsible for dealing with persistence. None of the other components persist anything, aside from potentially caching data.
The most important properties of the storage system is that it should be simple, and have a reasonable amount of security. It is not expected to scale massively, the expected user count of OpenVet is minimal.
The choice of SQLite for the database achieves simplicity. It is easy to deploy, upgrade. It has further advantages: because it runs in-process, it is easy to extend. The storage system uses this, to introduce type-checking into the database schema that is backed by Rust types. This makes it easier to change the schema or types, because we can make use of SQLite built-in functionality to re-check constraints. For recovery purposes, the database is backed up periodically to the object storage.
Object storage is used to store data. This is a lot of data, because OpenVet stores crate contents individually as files. We make use of content-addressing to get deduplication for free. Generally, it expects an S3-compatible storage system, because there are many providers for it. But due to the underlying implementation, it is easy to swap it out for a different protocol.