Measuring CPU usage of eBPF programs with Inspektor Gadget

Inspektor Gadget uses eBPF to inspect Linux systems and can be especially helpful on Kubernetes clusters. One question often asked before using Inspektor Gadget in production is how much CPU and memory resources it uses. This question has been difficult to answer because there was no easy way to check the resource consumption of eBPF programs. Kubernetes has tools to measure CPU usage of pods (kubectl top, metrics-server, cAdvisor, see Kubernetes documentation) but those tools don't measure eBPF programs. This is because eBPF programs run in the Linux kernel and not in userspace processes and thus cannot be directly associated with Kubernetes resources.

Linux 5.1 implemented a new feature to collect statistics on eBPF programs (see description in Quentin Monnet's blog post). These statistics can be fetched by bpftool. However, bpftool is not Kubernetes aware, so it's still difficult to collect those statistics on a Kubernetes cluster.

That's where the new ebpf top gadget comes in. It reuses the same mechanism as bpftool but exposes the data directly in the kubectl command. It shows statistics for eBPF programs from Inspektor Gadget and from any other project using eBPF like Cilium or Falco.

In this blog post, I will show the results of the top-ebpf gadget for three projects using eBPF programs on Kubernetes:

Inspektor Gadget itself
Falco
Cilium

Inspektor Gadget

Let's start two gadgets on an AKS cluster to see how they perform: the seccomp advisor gadget and the trace-open gadget.

$ kubectl gadget advise seccomp-profile start --podname normal-pod-mv855
$ kubectl gadget trace open

Then, let's make a pod run a busy loop:

$ kubectl exec -ti normal-pod-mv855 -- sh -c "while true ; do cat /dev/null ; done"

Now, observe the resource consumption of eBPF programs for 10 seconds:

$ kubectl gadget top ebpf \
        -o custom-columns=progid,type,name,pid,comm,mapmemory,cumulruncount,cumulruntime \
        --sort cumulruntime \
        --timeout 10
PROGID   TYPE             NAME             PID     COMM                      MAPMEMORY CUMULRUNCOUNT CUMULRUNTIME
1203     RawTracepoint    ig_seccomp_e     27805   gadgettracerman              509KiB       1873670   1.2693767s
1205     TracePoint       ig_open_x        27805   gadgettracerman              212KiB         14848   215.5186ms
1204     TracePoint       ig_open_e        27805   gadgettracerman              212KiB         14848    18.4793ms
1206     TracePoint       ig_openat_e      27805   gadgettracerman              212KiB         36647    10.1631ms
1207     TracePoint       ig_openat_x      27805   gadgettracerman              212KiB         36648     3.9473ms
1062     CGroupSKB                         1       systemd                         44B             0           0s
1057     CGroupSKB                         1       systemd                         44B             0           0s
1058     CGroupSKB                         1       systemd                         44B             0           0s
1059     CGroupSKB                         1       systemd                         44B             0           0s
1060     CGroupSKB                         1       systemd                         44B             0           0s
1061     CGroupSKB                         1       systemd                         44B             0           0s

We can distinguish the eBPF programs used by Inspektor Gadget: they are attached to the process gadgettracermanager and we recently renamed all eBPF programs with the prefix ig_ to make sure users can distinguish different gadgets. The seccomp gadget uses more CPU than the trace-open gadget. That's understandable because seccomp needs to observe all the system calls whereas trace-open only needs to observe open() and openat().

Let's stop the busy loop in the pod and observe the resource consumption again:

$ kubectl gadget top ebpf \
        -o custom-columns=progid,type,name,pid,comm,mapmemory,cumulruncount,cumulruntime \
        --sort cumulruntime \
        --timeout 10
PROGID   TYPE             NAME             PID     COMM                      MAPMEMORY CUMULRUNCOUNT CUMULRUNTIME
1203     RawTracepoint    ig_seccomp_e     27805   gadgettracerman              509KiB        340671   207.0367ms
1206     TracePoint       ig_openat_e      27805   gadgettracerman              212KiB         37446     10.252ms
1207     TracePoint       ig_openat_x      27805   gadgettracerman              212KiB         37445      3.995ms
1058     CGroupSKB                         1       systemd                         44B             0           0s
1059     CGroupSKB                         1       systemd                         44B             0           0s
1062     CGroupSKB                         1       systemd                         44B             0           0s
1057     CGroupSKB                         1       systemd                         44B             0           0s
1204     TracePoint       ig_open_e        27805   gadgettracerman              212KiB             0           0s
1205     TracePoint       ig_open_x        27805   gadgettracerman              212KiB             0           0s
1060     CGroupSKB                         1       systemd                         44B             0           0s
1061     CGroupSKB                         1       systemd                         44B             0           0s

As expected with less system calls, the eBPF programs are executed less often than in the first experiment. Note that both the seccomp and the trace-open eBPF programs used some resources even though the pods in the default namespace didn't run anything. This is because the eBPF programs are attached to the system calls globally, and they still need to check if they are executed in the context of a pod selected by the pod selector specified in the command line or not.

With top-ebpf, Kubernetes administrators can measure whether Inspektor Gadget resource consumption is acceptable or not. The answer might be different depending on the chosen gadgets and whether they are running permanently or occasionally.

Falco

For this experiment, we used an AKS cluster with a single node running Linux 5.4. We have contributed a patch to give suitable names to Falco's eBPF programs. To benefit from this patch, we use the Falco and Falco libs from the master branches at these commits:

Falco, commit 574a4b9f0aa0f8ccaa40a3e41d7659316b5f6b38
Falco libs, commit 6dec2858c7660f46dbe4bb02d8d801b642eefb08

$ kubectl gadget top ebpf \
        -o custom-columns=progid,type,name,pid,comm,mapmemory,cumulruncount,cumulruntime \
        --sort cumulruntime \
        --timeout 10
PROGID   TYPE             NAME             PID     COMM                      MAPMEMORY CUMULRUNCOUNT CUMULRUNTIME
   RawTracepoint    sys_exit         17626   falco                      31.88KiB        319243   171.8727ms
   RawTracepoint    sys_enter        17626   falco                      31.88KiB        319046   155.0668ms
   RawTracepoint    sched_switch     17626   falco                      23.88KiB         64380    92.4077ms
   RawTracepoint    page_fault_user  17626   falco                      23.88KiB         78504     6.6047ms
   RawTracepoint    sched_process_e  17626   falco                      23.88KiB           360      647.8µs
   RawTracepoint    signal_deliver   17626   falco                      23.88KiB           568        302µs
   RawTracepoint    page_fault_kern  17626   falco                      23.88KiB          1449        243µs
   RawTracepoint    sys_open_e       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_single       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_single_x     17626   falco                      1.597MiB             0           0s
   CGroupSKB                         1       systemd                         44B             0           0s
   RawTracepoint    sys_open_x       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_read_x       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_write_x      17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_poll_e       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_poll_x       17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_readv_pread  17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_writev_e     17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_writev_pwri  17626   falco                      1.597MiB             0           0s
   RawTracepoint    sys_nanosleep_e  17626   falco                      1.597MiB             0           0s

Here we observe something interesting: all the eBPF programs named after system calls (e.g. sys_open_e) don't have CPU stats, even though there was of course system calls executed during the 10 seconds of this experiments. This is because Falco makes use of tail calls. The kernel does not actually measure the CPU consumption of individual eBPF programs but measure chains of eBPF programs. Here we have sys_enter executing a tail call on sys_open_e, but all this time is accounting to the first eBPF program to the point of attachment of the raw tracepoint. Given that Falco heavily makes use of tail calls, this somewhat limits the usefulness of top-ebpf, but this will still be useful for measuring performance changes, for example between the current eBPF module and the new modern eBPF module based on CO-RE.

Cilium

For this experiment, we've used Cilium v1.13.0-rc0 on Minikube v1.26.1 with the following command:

$ minikube start --network-plugin=cni --cni=false
$ cilium install --version v1.13.0-rc0

I generated some network traffic in some pods, and I got the following results.

$ kubectl gadget top ebpf \
        -o custom-columns=progid,type,name,pid,comm,mapmemory,cumulruncount,cumulruntime \
        --sort cumulruntime
PROGID   TYPE             NAME             PID     COMM            MAPMEMORY CUMULRUNCOUNT CUMULRUNTIME
   SchedCLS         cil_from_host                              17.6MiB         36622 170.998549ms
     LSM              restrict_filesy                              24KiB        237237 143.289403ms
   SchedCLS         cil_from_contai                            24.3KiB         33348 117.345734ms
   Tracing          ig_top_ebpf_it   534935  gadgettracerman       12B        384533  42.150638ms
   SchedCLS         cil_from_contai                            24.3KiB           295   5.111228ms
   SchedCLS         cil_from_contai                            24.3KiB            31    804.081µs
    CGroupSKB                                                        0B             0           0s
    CGroupSKB                                                        0B             0           0s
    CGroupSKB                                                        0B             0           0s
    CGroupSKB                                                        0B             0           0s
    CGroupSKB                                                        0B             0           0s
    CGroupSKB                         518673  systemd                0B             0           0s
    CGroupSKB                         518673  systemd                0B             0           0s
    Tracing          dump_bpf_map                                  102B             0           0s
    Tracing          dump_bpf_prog                                 102B             0           0s
   SchedCLS         cil_from_overla                           24.33KiB             0           0s
   SchedCLS         __send_drop_not                                32B             0           0s
   SchedCLS         tail_handle_ipv                           22.35MiB             0           0s
   SchedCLS         cil_to_overlay                                  0B             0           0s
   SchedCLS         cil_to_host                                  24KiB             0           0s

Cilium started to give names to its eBPF programs with the prefix cil_ since v1.13.0-rc0 (#19159).

Using top-ebpf without Kubernetes: local-gadget

Some of the gadgets in Inspektor Gadget can be used without Kubernetes with local-gadget. This is the case with top-ebpf. Even though we've not implemented a pretty UI yet, you can still give it a try with the following commands:

$ sudo ./local-gadget interactive --runtimes docker
» create ebpftop trace1
» stream trace1 --follow
{"stats":[{"progid":50,"pids":[{"pid":1,"comm":"systemd"}],"name":"restrict_filesy","type":"LSM","totalRuntime":55801042,"totalRunCount":85641},...

What's next

We hope this tool will be useful both for Kubernetes administrators to evaluate the resource consumption of eBPF programs and for developers of eBPF projects to improve performance.

Here are some features we're working on:

Improve top-ebpf UI in local-gadget.
Annotate SchedCLS programs with the name of the network interface and the Kubernetes pod it belongs to.
Annotate CGroupSockAddr programs with the cgroup and the Kubernetes pod it belongs to.
Display eBPF program metadata.

I'd like to thank the Falco team for helping me to run the experiments and the Cilium team for helping me to understand how the eBPF statistics work and for the cilium/ebpf library.

Inspektor Gadget
Falco
Cilium
Using top-ebpf without Kubernetes: local-gadget
What's next

Inspektor Gadget​

Falco​

Cilium​

Using top-ebpf without Kubernetes: local-gadget​

What's next​

Inspektor Gadget

Falco

Cilium

Using top-ebpf without Kubernetes: local-gadget

What's next