Building a Toy Container Vulnerability Scanner

After doing some work around container vulnerability scanning recently, I realized that I had a pretty incomplete mental model of how scanners like Snyk or AWS Inspector actually work. A couple questions came up for me, the most pressing being: why can’t I use AWS Inspector to scan the otel/opentelemetry-collector image we use as a base for deploying the OTel collector in prod?

The answer was pretty simple: because the main stage of the collector uses scratch as a base, and Inspector’s ECR integration doesn’t support scratch base images. But I also learned that I don’t know all that much about container scanning and decided it was worth investing some more time to improve that situation. Doing more research, I realized that vulnerability scanning is a surprisingly straightforward process and this was a good opportunity to get some hands on experience working directly with Docker images.

This post explains what I did to build a “toy” vulnerability scanner, and what I learned. I used Go but my hope is that the code is simple enough for non-Go programmers to follow, and I’ve tried to link to as many sources as possible to make it easier to follow the decisions made. The completed project is on GitHub at jemisonf/toy-scanner.

But first:

Some background

Understanding what actually constitutes a container image¹ is really important to understanding container scanning. If you’re already familiar with image manifests and layers, feel free to click here to go to the next section.

Container images, as specified in the Open Container Initiative image-spec document, are made up of “layers” of files, with a “manifest” document that describes what layers exist in the image.

You can think of individual layers as just a bundle of files compressed with tar and gzip. Layers correspond 1:1 with lines in your Dockerfile; the line

ADD package.json package-lock.json .

will create a single layer containing those two files at whichever location they exist at in your container’s filesystem.

When you run a container, Docker applies those layers one at a time to the container filesystem in the order that they were created. So by inspecting each layer in a image, you can see every file that will appear in the container when it’s started.

Manifests are just metadata that tells container clients what layers exist for a given image. You don’t need to know a ton about them for the purposes of this post, just that they can tell us how to find the layers for the image we’re scanning. You can see the full specification in the OCI spec.

It’s also important to mention that manifests are just JSON files, and layers are essentially just tarballs. There are software libraries I’ll talk about in the next section that make it substantially easier to interact with container images, but a running Docker daemon isn’t required and even the libraries are optional – it would be possible to implement this toy scanner in a shell script using curl and tar if you were into that kind of thing.

How do you scan a container for vulnerabilities?

For this section, I wanted to be careful to avoid reinventing the wheel. So I did some research on existing container scanners to see what information is available about their approach to scanning containers. Snyk, a leading container security product, is unfortunately not open source and I didn’t find readily available resources about how they implement their product. But Clair, developed by Red Hat subsidiary Quay, is open source, and has actually published an overview of their architecture and implementation here: https://quay.github.io/claircore/.

Clair’s architecture has two key components:

Indexers continuously scan for new image layers, and can pull in an image to identify which packages are installed in the image, and then generate a “report” that can be consumed by other Clair components
Matchers, which take reports produced by indexers and compare the packages in the report to public vulnerability databases to identify if a package in a report contains any vulnerabilities

A lot of engineering work in Clair and other projects goes into supporting continuous scanning of container registries and checking for a wide range of operating system and language dependencies. So I added some constraints to make my life a little easier:

The toy scanner should just be a CLI tool that scans a single image on-demand
The scanner will only support Alpine packages

I picked Alpine because my impression from the Clair source code was that Alpine is one of the simpler distributions to scan for installed packages. At the same time, learning to scan Alpine will teach you most of what you’d need to know about container scanning in general because most individual scanners tend to follow a similar set of patterns.

With those restrictions in place, our much-simplified scanner architecture looks like:

Without much further ado, let’s get to it!

Step 1: setup and image metadata

I used Go for building the components of the scanner so I could use Google’s go-containerregistry module, which makes working with remote container images extremely simple. Reading an image manifest just requires parsing the image tag with name.ParseReference and then downloading the image with remote.Image. The collected code looks like:

func main() {
	var image string

	flag.StringVar(&image, "image", "", "Image to scan")
	flag.Parse()

	ref, err := name.ParseReference(image)

	if err != nil {
		fmt.Printf("Error parsing image name: %s\n", err.Error())
		os.Exit(1)
	}

	docker_image, err := remote.Image(ref)

	if err != nil {
		fmt.Printf("Error fetching image: %s\n", err.Error())
		os.Exit(1)
	}

	manifest, err := docker_image.Manifest()

	if manifest != nil {
		// do something
	}
}

Having manifest and docker_image gives us all the information we need for the next section. manifest has a Layers field we can use to find the digest² for each layer of the image, which we can then use to get a Layer object with the LayerByDigest field on docker_image.

Step 2: identifying packages

Before we start: I would like to say that I am not an expert at Alpine, vulnerability databases, or Linux package management in general, and that I can vouch for the educational value of this section in regards to learning how vulnerability scanners work but I can’t promise there aren’t some errors as I start to get into specifics.

For this step, I had to do some research on actually scanning Alpine for vulnerabilities. I started at the end, with asking what the minimum information you’d need to actually match a package to a vulnerability was. Looking at Alpine SecDB, which is listed as Clair’s vulnerability database for Alpine, I found that the three most relevant pieces of data are: Alpine distro version, package name, and package version. Here’s how I encoded that information:

type AlpineReport struct {
	Packages []AlpinePackage
	Version  string
}

type AlpinePackage struct {
	Name    string
	Version string
}

The term “report” here is borrowed from Clair, where indexers produce reports that are consumed by vulnerability matchers.

The outline of what I called the AlpineScanner method looks like this:

func AlpineScanner(manifest v1.Manifest, image v1.Image) (*AlpineReport, error) {
	layers := []v1.Layer{}
	for _, layer_desc := range manifest.Layers {
		layer, err := image.LayerByDigest(layer_desc.Digest)

		if err != nil {
			return nil, fmt.Errorf("Error fetching layer %s: %w", layer_desc.Digest, err.Error())
		}
		layers = append(layers, layer)
	}

	packages := []AlpinePackage{}
	var version string
	for _, layer := range layers {
		// scan layer
	}

	return &AlpineReport{
		Packages: packages,
		Version:  version,
	}, nil
}

In short: collect the metadata for each layer based on its digest, then iterate over the layers one by one and collect the packages for each layer into a report containing the full list of packages in the image.

That leaves left two things to implement while scanning the layers: find the Alpine distro version, and find each package installed.

First, we have to figure out how to scan individual files in each layer. I had to do some research on this, but it turned out to be pretty straightforward. layer.Uncompressed() will return an io.ReadCloser representing a readable stream of the compressed tarball for the layer, and we can then pass that into tar.NewReader and get a tar.Reader object that we can use to iterate over individual files:

reader, _ := layer.Uncompressed()

tar_reader := tar.NewReader(reader)

header, err := tar_reader.Next()
for header != nil && err == nil {
  // process the file
  // ...
  // now advance to the next file
  header, err = tar_reader.Next()
}

Calling tar_reader.Next() here will return a header object for each subsequent file in the tarball, with we can use to get the name of the file as well as reading its contents.

One reason I picked Alpine is that the next steps are really simple. Reading through Clair’s implementation, there are two files we need to look for:

A file in /lib/apk/db/installed that has a list of every package installed on the system
A file in etc/os-release that contains the distribution version

When we find /lib/apk/db/installed, we need to do a little additional parsing. If you run docker run alpine cat /lib/apk/db/installed you can see a full example, but here’s a single package for reference:

C:Q1tSkotDvdkl639V+pj5uFYx3AGIQ=
P:libssl1.1
V:1.1.1n-r0
A:aarch64
S:207620
I:536576
T:SSL shared libraries
U:https://www.openssl.org/
L:OpenSSL
o:openssl
m:Timo Teras <timo.teras@iki.fi>
t:1647383879
c:455e966899a9358fc94f5bce633afe8a1942095c
D:so:libc.musl-aarch64.so.1 so:libcrypto.so.1.1
p:so:libssl.so.1.1=1.1
r:libressl
F:lib
R:libssl.so.1.1
a:0:0:755
Z:Q1jtZ7ec5Jx9TJ+IUk5mJnXnn+Gd4=
F:usr
F:usr/lib
R:libssl.so.1.1
a:0:0:777
Z:Q18j35pe3yp6HOgMih1wlGP1/mm2c=

Each one of these package blocks is separated by a double newline, and then each line within the block represents a different field. Because I couldn’t find documentation about the details of the format, my implementation borrows really heavily from Clair’s Alpine package scanner:

if header.Name == "lib/apk/db/installed" {
	contents, _ := io.ReadAll(tar_reader)

	entries := bytes.Split(contents, []byte("\n\n"))

	for _, entry := range entries {
		lines := bytes.Split(entry, []byte("\n"))

		var name, version string

		for _, line := range lines {
			if len(line) == 0 {
				continue
			}
			switch line[0] {
			case 'P':
				name = string(line[2:])
			case 'V':
				version = string(line[2:])
			}
		}

		if name != "" && version != "" {
			packages = append(packages, AlpinePackage{Name: name, Version: version})
		}
	}
}

I’m deliberately dropping a bunch of fields here from the package block to keep the implementation simpler. I suspect that means there are edge cases this code might not handle, but I’m not super concerned about that since my intent here is not to write an airtight vulnerability scanner.

Next, we need to find the Alpine version used by the container. This is straightforward enough to do just by checking /etc/os-release, a standard across Linux distributions for where the distribution version is stored. As an example:

$ docker run alpine cat /etc/os-release      
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.15.1
PRETTY_NAME="Alpine Linux v3.15"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

I don’t think that just relying on VERSION_ID here will necessarily cover every edge case, but like with the package versions that should be ok for the purposes of this tool. Here’s the implementation:

if header.Name == "etc/os-release" {
	contents, _ := io.ReadAll(tar_reader)

	lines := bytes.Split(contents, []byte("\n"))

	for _, line := range lines {
		if strings.Contains(string(line), "VERSION_ID") {
			fmt.Sscanf(string(line), "VERSION_ID=%s", &version)
		}
	}
}

With that done, we can print out the return from the scanner and get something like:

{Packages:[{Name:musl Version:1.2.2-r7} {Name:busybox Version:1.34.1-r4} {Name:alpine-baselayout Version:3.2.0-r18} {Name:alpine-keys Version:2.4-r1} {Name:ca-certificates-bundle Version:20211220-r0} {Name:libcrypto1.1 Version:1.1.1n-r0} {Name:libssl1.1 Version:1.1.1n-r0} {Name:libretls Version:3.3.4-r2} {Name:ssl_client Version:1.34.1-r4} {Name:zlib Version:1.2.11-r3} {Name:apk-tools Version:2.12.7-r3} {Name:scanelf Version:1.3.3-r0} {Name:musl-utils Version:1.2.2-r7} {Name:libc-utils Version:0.7.2-r3}] Version:v3.15}

Step 3: matching packages and vulnerabilities

Let’s start with some data structures. I relied pretty heavily on the Alpine SecDB data format to figure out how to structure the data in the program. You can see an example here.

Based on that, the return value we’re looking for would like something like:

type Vulnerability struct {
	CVEs        []string
	PackageName string
	Version     string
}

It’s also handy to have a partial data structure to unwrap the SecDB data into. I used three structs for that:

type SecDBReport struct {
	Packages []SecDBPackage
}

type SecDBPackage struct {
	Pkg SecDBPkg
}

type SecDBPkg struct {
	Name     string
	Secfixes map[string][]string
}

That just leaves some setup to do in the actual method, specifically:

Compute the major.minor version of the distro from the major.minor.patch version we already have
Use that to fetch the vulnerability document from SecDB
Unwrap that value into a SecDBReport object

With some error handling omitted, that looks like:

func AlpineMatcher(report AlpineReport) ([]Vulnerability, error) {
	vulnerable_packages := []Vulnerability{}

	majorMinorVersion := semver.MajorMinor("v" + report.Version)

	client := http.Client{}

	secDBURL, _ := url.Parse(fmt.Sprintf("https://secdb.alpinelinux.org/%s/main.json", majorMinorVersion))

	res, _ := client.Do(&http.Request{
		Method: "GET",
		URL:    secDBURL,
	})

	responseBytes, _ := io.ReadAll(res.Body)

	var contents SecDBReport

	json.Unmarshal(responseBytes, &contents)

	// check vulnerabilities here

	return vulnerable_packages, nil
}

To actually match packages to vulnerabilities, we can compare every installed package to every SecDB entry and see if the installed version matches any CVEs:

for _, installed_package := range report.Packages {
	for _, secdb_package := range contents.Packages {
		if secdb_package.Pkg.Name == installed_package.Name {
			if CVEs, ok := secdb_package.Pkg.Secfixes[installed_package.Version]; ok {
				vulnerable_packages = append(vulnerable_packages, Vulnerability{
					PackageName: installed_package.Name,
					CVEs:        CVEs,
				})
			}
		}
	}
}

Testing it out with go run . -image alpine you get . . . no vulnerabilities found. Which is actually expected – Alpine is a small distribution with good security hygiene, so the latest tag typically won’t have any issues to find.

Can we test it with a known-bad image to make sure we aren’t getting false negatives? I had a surprising (in a good way) amount of difficulty doing this – Alpine doesn’t have any of the vulnerable packages I looked for in its package index, and I’m not sure where to look for an old image that might contain the vulnerable packages. So I’ll, uh, leave further testing as an exercise to the reader.

Conclusion

I had a couple of fairly important takeaways about container vulnerability scanners while working on this that I think are worth sharing, because I suspect they’re related to common misconceptions about container scanning:

Container scanners don’t require any kind of special manifest to identify installed packages. They just read the files in the image; there’s no extra magic happening to make scanning possible.
Container scanners do need packages to be installed in a standardized way. As we saw when building the Alpine scanner, anything not installed through apk would not have been caught by our scanner.

With a normal scanner, you can assume it will also catch anything installed with standard language tooling like pip or npm as well, but you should not assume that something downloaded directly can be scanned.
Container scanners don’t operate on “your” code at all. They won’t catch logic errors or previously-unknown vulnerabilities in your implementation.

Running static analysis of your code would be possible in theory but it isn’t a standard practice for container scanners, and it is not possible to do anything that requires dynamically executing code.
Container scanners are based on reported vulnerabilities so the value of a scanner is going to depend partly on how good a language or distribution is about identifying and publishing vulnerabilities.
Container scanners are for trusted images. Bypassing a container scanner to ensure that an image contains an insecure binary would be trivial, and attackers have various tools to compress layers together to make it even more difficult to detect that a layer contains malicious code.

That means that your scans are useful if you can trust that the images you use are installing packages in a standardized way and otherwise not trying to decieve you about their contents. The scanner protects you from yourself, not from other people.

Considering all those points together, it really helps emphasize that container scanning has an important but limited role to play in making sure that trusted container images are using up-to-date dependencies without known security vulnerabilities, but that scanning doesn’t replace other security best practices. I also learned a lot about interacting directly with container images and I think this is a neat example of how much you can do with an image without running a Docker daemon at all.

Whether or not we’re talking about a “Docker” image is an interesting question. Early in the days when containers were becoming a mainstream technology, Docker containers and “containers” were essentially synonymous. But now, Docker is no longer the only way to run containers or even the most popular; containerd is the container runtime that underlies a lot of managed container services, the Kubernetes project is removing Docker entirely, and you can find a number of container building services like buildah or ko that can build OCI-compliant container images without a Docker daemon. I’m generally going to try to use the term “container” instead of “Docker container” in this post, and you can assume everything mentioned here refers to OCI containers and is not Docker exclusive. ↩︎
The the digest of a layer is a hashed value that can be used to uniquely identify the layer, and that is generated based on the contents of the layer. sha256 is the most common method I have seen for generating digests. See the OCI descriptor specification for more detail. ↩︎