*** Looking up your article... *** Found your article... :~$ head alex.ascii ,,,... ...,........ .*//(((((((/(((((((((####((/*,. .,*/(//(//(((#%%###(##%%%##(#%%%%%#####(*.. .***/((//((##/###(#%&&&&%%%&%%%&&%#%%%%&&&&%%%%#/,. .,**/(/((##%#(((###%%%%%%&@@&&&&&&&&&&&&&@@&&&&&&&&&%(. ..*///#((#(((#%%%##%%%%&&&&&&@@@&&&&&&%%%&%&&&@@@@@@@&&%#*. ,*(/(((((((((###%%%%%%%&%%%&&%%%%%#%###((/////((##%&@@@@&&&%%(*. .*/(/(##(########%#%%%%%%###%%####((((///**********//(((#&&@@@&&&%(, .*/((##############(((////(/*******,,,,,*,,*,,,,*****/////((#%@@&&&%#(*. ,/(/(((####%#(((///**,,,,,,,,,,,,,,,,,,,,,*,,,,,*********//////((&@&&%%&%/. ,/((#((((##(((/**,,,,.....,,,,,,,,,,,,,,,,,,,,,,*************////(%&&%%%%%(. ,*/((##/(##(*...,........,,,,,,,,,,,,,,,,,,,,**,,,*************///(%&&%%%%%/. .*(##(((((#%(,.. .. .......,,,,,,,,,,,,,,,,,,,,,,,,,***********////(#%&&%#%%%(, ,/(((//(##*,. ....,,,,,,,.,..,,,...,,,,,,**************/////(#%&%%%%%%%#*. ,(##(###%&%#*. .,*/(#%###((/*,,..,,,..,,*/((##%%%###%%####((////((#%%&%%##%%%#, ,(##(###&%&%#, ,////**,,,,*/(/////*,,,,,,,**/((##%%&%&&&&%%&&%&%#(///(%&&%%&%%%%%#*. ,(%%%#####(%(..*,,**/////*/////*/,**,.,,***/(* .*/(((#(((///*/((#%%#(((%@@%&&&&&%%#((/, ,(###((/. ,,,,///(/*....,///*//(/,.,,,/((.*((/(*,/(##/,,,*((**//((((((%@&&&&@@&%(,. ./((((((%,,,**** /*/##(/,/*.,,***/(.. ...,(*//(#(,.,,**(((**/(((%&@&@@@@&@%*. ./((##%&.,,*, . .//./(//,..**.,**/(*//*,*/(/(#((/,,,***//*,*((###%@@@@@@&&%* ,(##%&(.,,..,,,. . .,.. /.,,**//#(./*,,*,,,.,.,*********,////((&@@@@@@@&. ,(##%. ,. . . *. .,**/((*.****,,,,,........,,,,**////#&@@@@@&&&(, .*/* ............... .. ....,**//((/,*/*****,,,,,,,,,**////////&@&&&&&%, ,. ........,,,.............,**//(//*/**,***,,,,,,,*//////////%&%#/*. .. ..........,,,,,,*///**///****,***////***//*/*/////(%(*. .. ........,.,*((#/**//##////*,,,,,,,*************///(#(*. .. .......,,,,..,*/*,,,****/(((((/*,,,,,,,,,********/////(((/*, ... . .............,*(((/(//////(####//***,,,,*,,,,*****/////((#/,. .,............,*///(##(///(///((((##%%#((///*,,,********////((##*. ,,,........*/(((#(/**,,,*,,,,,,*//#%%%%%%%##(/******////((((###*. ,.,,,,,.,.*(((((/**//(((/,,,/**/(/(((((#%%%%%%#(//*//((((((###. .,,,,,.,,,(/****/#//***,,,,,*//*/(#%%##((((#%#(//(((((((((#( .,,****,,,,,,,,,...,,..,,,,*******////((///**//(#(((((#((((##* ,**,***,,*,,,.........,,,,********///******///*/(((#((((####. ,****,,,,,***,,.....,*///(((((((((//****//(((/(((###(###%%/. .****,,,*/((/*,,....,*//(#((//****/*///(######(((#(##%%&%(. ,*//**/((#((/*****//*//////////(#(##%%%%&%####%#&%&&%/, *##########((**//////(/((/(/#(##%%&&%&&&&%&&&&&&, ,#%%&%%%###(((((((//((((((###%%%&&&&&@@&@@@&&&%(, ,/#%&%%%#(##(#(#((########%#%%&&@&@@@@&@&&%#*. .*(%&&%%###(####%#%%###&%&&&@@@@@&&&&%%%#/, .,*(%&&%%%%&%&&&&&&&&&@@@@@&&&&%&%(*. .,*(##%%&%%%&%&&%%%%#((///*,. . ~~[every Boring Problem Found in eBPF] by @FridayOrtiz~~ > /WHOIS @FridayOrtiz *** @FridayOrtiz https://ortiz.sh/contact/ > /LIST *** eBPF, BPF, Linux Kernel, guide, tips +-#- Introduction -#- | | About six months ago I started a new job and dove into adding Berkeley Packet | Filters (BPF) as a telemetry source for our Linux endpoint security agent. This | work culminated in the release of three open source libraries ([0], [1], [2]). | This isn't about that, though. This is about the issues we ran into while | implementing BPF as defenders, and how defenders can use BPF in their | environments (although attackers should find useful tips in here too). I went | through every PR, Jira ticket, and message from the past six months to put | together this list of BPF gotchas and their solutions. I hope it helps | defenders, developers, and researchers ramp up with BPF faster than I did. | | Note: I'm going to use BPF to mean extended BPF (eBPF), since that's the | official name. Not to be confused with the old classic BPF (cBPF). I'm also | going to assume you're *loosely* familiar with BPF, enough to be considering | whether or not to deploy it in your environment. +- +-#- Why even use BPF? -#- | | You may be wondering, "if you're filling an article with caveats about BPF, why | should I even bother trying to use it?" Great question, straw man. There are a | number of things BPF is really, really, good at that you should consider. | | - **You can get visibility (almost) anywhere you want.** If there's a specific | code path within the kernel (or userspace) that you know will be executed during | an attack, you can put yourself there. If there's a payload value in a packet | you need to see before it hits your `iptables` rules, you can do that. Want to | modify or block syscall args? You can do that too. | | - **You can reinstrument dynamically.** Change your mind about what you want to | inspect? Change it while it's running. You can either swap out the entire | program (although that might not be possible in the future with signed BPF | programs[4]) or modify behavior by updating BPF maps from userspace. | | - **It's safe!** You can do all these things with a Linux Kernel Module (LKM), | sure, but the BPF Virtual Machine (BPF-VM) and verifier ensure (or at least | try really hard to ensure[17]) that you can't panic or break the kernel. | | - **It's container aware, or at least it can be**. Instrumentation alternatives | like `auditd` tend to struggle in containerized environments, returning values | that only make sense in certain namespaces, or losing track of things entirely. | BPF, on the other hand, can give you information in whatever context you want | (as long as you program it that way). | | - **It's fast.** Do as much work as possible in the BPF program before sending | information up to userspace, and you can avoid expensive context switching or | race-prone data enrichment. | | - **It's atomic (sort of).** BPF programs generally aren't preemptable (there | are exceptions[5]). This applies to tail calls as well, so you can set up some | fairly complex logic in your instrumentation without worrying too much about | reentrancy. +- +-#- The problem with BPF. -#- | | From this writer's perspective, there are two main problems with BPF: 1) it's | now being used in ways it was never designed for (i.e., it's evolving naturally | over time) and 2) there's a large overlap in maintainers of the Linux kernel's | BPF subsystem and userspace BPF tooling. | | The following are concrete issues stemming from (1): | | - BPF isn't really CO-RE (Compile Once - Run Everywhere), it's more CE-RO | (Compile Everywhere - Run Once). A lot of userspace tooling "achieves" CO-RE in | practice by compiling on the host machine. New true CO-RE features are... new. | Chances are you need to support a kernel that doesn't have them. End-of-life | doesn't mean much when the host is running a business-critical function and the | suits see too much risk in upgrading it. And loading a full toolchain to | compile BPF programs on the host is often a no-go too. | | - The toolchains and libraries are designed around `bpftrace`-like use cases. | That is, one-off tooling for diagnosing specific problems. Brendan Gregg's | book[7] is a great resource for this. Now that BPF for long-running daemons is | gaining popularity, the maintainers are working hard adding features to support | this (like the aforementioned true CO-RE). Unfortunately, again, these features | probably won't exist on the kernels you need to support. | | - There are many different types of BPF programs (which we'll cover), that all | have varying load and run semantics. Depending on whether you want to run a | kprobe or a TC classifier, you'll have to use entirely different methods to do | so. And while you're writing them, the helpers available to you can vary | wildly. And the documentation is incomplete, scattered, and often out of date, | because... | | | And here are some specific examples of issues stemming from (2): | | - Because of the overlap, documentation of the pure BPF interface(s) (there's a | plethora, we'll cover that) is lacking. The people that maintain it write the | userspace tooling, so they don't need in-depth documentation. Seriously, go | check out the BPF manpage for whatever distro you're on. Chances are it's | missing a ton of helpers and there's more than one "TODO: fill this out" that's | been sitting there for years. Why not use their userspace tooling? Well... | | - Their userspace tooling is a magic labyrinth. In order to get close to CO-RE in | a backwards compatible way, it's filled with kludges you probably don't need. | Ideally, you'd interface directly with the underlying syscalls and use only | what you need. But doing that is undocumented. And, because of the | documentation issues, there's really no community drive to simplify these | libraries. Because these libraries cover (until recently, see (1)) the majority | of historical use cases, there's no drive to improve the documentation. Even if | you did, you'd have to backport and patch your documentation to cover all the | little idiosyncrasies across kernel versions, and boy are there a lot of those. +- +-#- Implementation Issues -#- | | While working with BPF we ran into a number of implementation specific problems | that lead to us building and publishing those three ([0], [1], [2]) BPF tools. | If you're a defender, or work in security, and you're considering getting | started with BPF here's a list of things you'll probably want to know. | Presented in somewhat logical order. | | +-**- The verifier sucks, but the alternatives are worse. -**- | | | | -...- Problem -...- | | | | You will run into lots of problems with the verifier. For example, what's the | | difference in the following two code snippets? | | | | ```c | | u32 *p = 1; | | u32 i = *p; | | ``` | | | | ```c | | u32 *p = 1; | | u32 i = NULL; | | __builtin_memcpy(&i, p, sizeof(u32)); | | ``` | | | | I'll tell you: the first one fails the verifier, the second one does not. But | | only sometimes, except when it works. Which depends on the kernel version. | | Maybe. I mean, they should compile to the same thing, right? Apparently not, | | and subtle differences can completely throw the verifier off. | | | | The real problem with the verifier is that it's getting better all the time. As | | BPF use cases settle out, the maintainers are changing the BPF verifier to | | better support them. That means on older kernel versions, without these | | patches, you'll have to perform strange workarounds to get your code working | | with older verifiers. | | | | A few more verifier problems you'll likely encounter supporting a wide range of | | kernel versions: | | | | - The verifier hates looping. But, sometimes, it also hates loop unrolling. If | | `clang` generates enough jumps and gotos, even if you tell it to unroll | | everything, the verifier might (depending on version) fail it anyway. The | | verifier needs to be able to keep track of all branches and ensure a maximum | | depth limit. If it can't (whether it's because you're looping or because the | | verifier can't keep up) your program will fail to verify. | | | | - In older kernel versions (but not newer ones) variable reads and writes are a | | big no-no. All offsets must be known at compile time. That means you can't do | | things like set `some_array[variable_index] = some_value`. This, plus the | | aversion to loops, makes it nearly impossible to read strings from memory on | | kernels without the `read_str` family of helpers. The kernel's own `qstr` | | involves variable memory access—and good luck finding (or setting) the null | | terminator on your own. | | | | - Everything that might be a pointer must be null checked. If you don't, the | | verifier will refuse to load your program even if it's safe. This makes it hard | | to work with programs that might expect a null value. The convention that most | | pleases the verifier is to return immediately after a failed null check, and | | getting around this is tricky and involves trial and error. | | | | There are alternatives to the kernel verifier, such as PREVAIL[8], but they | | have their own set of issues. For what it's worth, PREVAIL is an impressive | | project and Microsoft will be basing their Windows BPF verifier off of it. But, | | unfortunately, it doesn't match the expected behavior of kernel verifier. Just | | because something passes PREVAIL doesn't mean it will pass the kernel verifier. | | Just because something fails PREVAIL doesn't mean it will fail the kernel | | verifier (even though it probably should). | | | | -...- Solution -...- | | | | **Run early, run often, run everywhere.** Your development environment should | | make it as easy as possible to test your code on all the kernels you need to | | support (or as close to a representative sample as you can get). The only way | | to know if the kernel verifier will accept your program is to run it through | | the kernel verifier, the real kernel verifier, on the specific kernel you're | | targeting. Note that this means the distro-specific kernel, with all their | | modifications and backports. For example, the older Enterprise Linux (red hat, | | centos, and so on) kernels (2.x and 3.x) have backported BPF features that | | might surprise you, since they don't line up with mainline kernel version | | numbers. The only way to know what's supported is to try it. | | | | **Enable logging.** This one comes with a caveat. You need to provide the | | verifier with a large buffer of memory to write its verification logs into. If | | you don't give it enough space it will fail verification, even if the program | | would otherwise pass. If your programs are complex, then make sure your buffer | | is large enough (but not too large, or loading will take forever) and be sure | | to turn off verifier logs in production to avoid issues with programs failing | | to load when you know they should. | | | | **The error messages you get will seem cryptic at first.** The BPF verifier | | uses a lot of terminology and has a lot of restrictions that are undocumented | | (of course) that you'll learn with time. If you get stuck, the BPF Compiler | | Collection (BCC) GitHub repo's issue tracker[9] is a great resource. You can | | probably find a Brendan Gregg ticket that goes over at least the broad class of | | error you're getting. | +- | | +-**- BPF doesn't really exist. -**- | | | | -...- Problem -...- | | | | BPF is really just an instruction set, for which the Linux kernel provides a | | VM, verifier, and some helper functions. You run your programs inside this | | execution context, and call the helper functions to extend the VM's | | capabilities. When you write a BPF program, what you're really writing is a | | kprobe, or a uprobe, or an eXpress Data Path (XDP) classifier, or a Traffic | | Control (TC) classifier, or one of the many other types of kernelspace programs | | that have been offloaded to the BPF subsystem. There's a ton of BPF program | | types and more are being added all the time, for a variety of use cases. It | | turns out being able to safely execute code in the kernel enables a ton of | | interesting and helpful functionality. Unfortunately, every program type has | | its own way to load, run, and clean up after it, most of which is entirely | | undocumented. | | | | On certain distros, the tools you'll need to load these programs might not be | | enabled by default. For example, some distros don't automatically mount | | `debugfs`, which you'll need to load kprobes on older kernels. | | | | When you do figure out how to load your program, the ABI for defining programs | | is entirely based on undocumented, implicit, convention. For example, you'll | | see a lot of `SEC("kprobe/my_kprobe")` to tell the loader that you're loading a | | kprobe. | | | | ```c | | /* helper macro to place programs, maps, license in | | * different sections in elf_bpf file. Section names | | * are interpreted by elf_bpf loader | | */ | | #define SEC(NAME) __attribute__((section(NAME), used)) | | ``` | | | | This is actually entirely unnecessary, on the syscall level, and is merely a | | common convention. As you can see in the above snippet, it's just a macro to | | set the section name in the compiled ELF executable. There's nothing | | BPF-specific about it. So you not only have to know the requirements of the | | program type you're trying to load, but also the conventions used by the tools | | that load and run it. | | | | -...- Solution -...- | | | | Figure out what you want your program to do first. Do you want visibility into | | the kernel? Then you'll probably want a kprobe or tracepoint. Do you want to | | drop inbound packets? You probably want XDP. Do you want to build detections on | | outbound traffic? You might want TC, or you might want a kprobe in the kernel | | network stack. Figure this out, then figure out what you're going to need to | | run it the way you want to run it (one off? daemon?). When we made `oxidebpf` | | we had to optimize for stability in the features we needed most (e.g., kprobes) | | over coverage of all the different BPF program types. | | | | If you can't find libraries to suit your needs for your chosen program type, | | you'll probably have to write it yourself (or contribute it to an open source | | project). Because everything is poorly documented, you'll have to dig through a | | lot of source code to put together the real set of necessary functionality. The | | official-ish libraries like libbpf and libbcc tend to work the best, but | | there's issues there (that we'll get to). | | | | I highly recommend using `bpftool` for debugging while developing. It provides | | the easiest-to-use view into what programs and maps are loaded and where. It | | lets you visualize data in maps, dump programs, and more. The only problem with | | `bpftool` is that it's never in the same package. Some distros and repos let | | you install it with a `yum install bpftool`. Others require you `apt-get | | install linux-oem-tools`. Sometimes you need to `apt-get install | | linux-oem-tools-`uname -r``. It depends. Whatever you're running, though, | | you'll probably want this tool installed. | +- | | +-**- I hope you find constraints fun. -**- | | | | -...- Problem -...- | | | | Are you one of the dozen or so people unreasonably upset that `0x10c`[10] was | | never released? Me too! I find working within constraints challenging and | | enjoyable. And let me tell you, BPF programs have a lot of constraints. | | | | You get 512 bytes of stack space for your program, half a kilobyte. This | | doesn't appear to be something that has or will ever change. It's also unclear | | if this applies to tail calls. Some documentation implies that tail calls use | | the same stack space, so you're limited to 512 bytes total, but in practice it | | seems to be 512 bytes per program. And `clang` probably won't be able to help | | you. BPF programs, for whatever reason, don't like to reclaim stack space. Your | | variables will get hoisted and instantiated at the start of execution. If you | | want to do things like dump syscall arguments or `pt_regs`, especially when | | working with strings, you'll find yourself running out of stack space very | | quickly. | | | | There's a practical instruction limit of about 4096 instructions. The | | instruction limit in the past was set (as far as I can tell) based on what the | | verifier could verify before declaring "this has gone on too long, I can't | | verify this won't halt, so I'm failing it." You can get more instructions by | | manipulating the verifier, and doing other tricks you'll find in the mailing | | lists, if you really want to put in the effort. Newer verifiers let you get | | upwards of 1,000,000 instructions, but that'll only help you if you're | | supporting newer kernels. | | | | -...- Solution -...- | | | | To work around the instruction limit, you can use tail calls. Tail calls are | | the closest BPF has to a true function call. You transfer flow over to | | whichever program you call into. You can chain tail calls like this together up | | to 33 times. There are some caveats, which we'll get to later. | | | | There are a few tricks you can use to work around the stack limit. One trick is | | to explicitly reuse stack space. For example, reusing variables or | | instantiating a struct of bytes to act as your scratch space and manually | | reusing offsets within it. Another trick is to build your own stack with maps. | | On some kernel versions you can request a struct from a map and get a pointer | | to it. If the requested struct doesn't exist (e.g., the array map at the | | requested index was empty) you'll still get back a pointer to an empty struct | | that you can manipulate. Other kernel versions require this map-struct to be | | copied to the stack before being modified, so your mileage may vary. | | | | With all that said, I want to offer some practical advice. If the information | | you're retrieving is too big to ever fit on the stack, you should just send it | | out as you read it. Create a messaging type and pipeline for chunking and | | rebuilding data in userspace, copy as much as you can to the stack, and then | | send it up through a map. This will run on a wider range of kernel versions, | | and you won't have to worry about if your host kernel allows directly | | manipulating and emitting map memory. You can reconstruct it in userspace at | | will. This is what companies like Google are doing for their BPF telemetry. | +- | | +-**- The good stuff is GPL. -**- | | | | -...- Problem -...- | | | | All the useful helper functions (like `perf_event_output`[6]) are exported as | | GPL-only. If you want your program to do anything useful, you're going to have | | to license it under GPL. That makes it hard to make proprietary programs based | | on BPF. If your program is only internal, and never distributed, you're fine. | | But if you start distributing your programs (to customers, friends, wherever) | | you need to publish it under GPL. | | | | -...- Solution -...- | | | | Short answer: Make the world a better place, release your BPF code and tools. | | | | Long answer: BPF is still a niche and complex discipline, so open sourcing your | | tooling doesn't reduce competitive effectiveness for a business. From an | | individual perspective, open sourcing your tooling gets your name out there and | | makes you more valuable as an employee. From an employer perspective, the more | | accessible BPF becomes the easier it will be to hire people to build and | | maintain it. From the community perspective, we can all learn from each other | | by working in the open. Perhaps you have an interesting use case that the | | maintainers of other libraries would want to know about, or could offer advice | | on. Everybody wins. | +- | | +-**- By default, you get the default. -**- | | | | Alternatively, BPF is only good with containers if you tell it to be. | | | | -...- Problem -...- | | | | Be careful with the assumptions you make about the information you retrieve | | from a BPF program. If you grab the retcode of a `fork` call, it's going to | | give you the retcode of the `fork` call: the pid in the namespace of the | | calling process. Maybe this is what you wanted, or maybe you really wanted the | | pid of the child process un-namespaced. Maybe you ask the BPF program to gather | | the pid (with the `get_pid_tgid` helper). You take the upper 32 bits, | | corresponding to the pid, but nothing lines up. Well, you're executing in | | kernelspace which means the `pid` you probably want is actually the `tgid`, and | | what you got was a `tid`. Unless you wanted a `tid`, in which case you should | | get the `pid`. The kernelspace understanding of a `pid` is not the same as the | | userspace understanding of a `pid`. If you want to identify a file, you | | probably want the inode number and device number, a file descriptor won't be as | | useful. | | | | -...- Solution -...- | | | | If you want your program to retrieve information, think about what information | | you need to retrieve. Make sure you know where that information exists (what | | structs, where they live in memory, and how to get there) and then find a place | | (assuming you're launching a kprobe) in the kernel you can attach your program | | as close to that information as possible. For example, if you really need the | | root namespace pid of the child process of fork, you probably want to hook | | somewhere in the path of the new child process so you can grab the `pid` from | | the `task_struct`. | | | | Be aware that this location might change between kernel versions, or the | | information may take a different form. You may have to choose a less optimal | | probe point that is available on more systems. Or you may have to change the | | information you're gathering to something else that exists on all the kernels | | you support. That leads us to the next two issues. | +- | | +-**- CO-RE (probably) won't help you. -**- | | | | -...- Problem -...- | | | | The maintainers are constantly adding feature to help BPF developers compile | | once-run everywhere their BPF programs. Unfortunately, you'll likely find | | yourself trying to target kernel versions that don't have these features. Or, | | if you do, since these features are added in piecemeal, it may not have all the | | CO-RE features you expect. | | | | For example, the BTF feature makes it possible to reference struct members | | directly, even if they've been compiled in a randomized layout, across | | different kernel versions, and without recompiling. This feature was added in | | April of 2018[11]. You will probably need to write code for kernels from before | | April 2018. This means something like `current->real_parent->pid` is not | | guaranteed to work without recompiling for (or on) the host. | | | | -...- Solution -...- | | | | There's really no way around this one. It's what we're doing, Microsoft is | | doing it too for their Linux machines, and I'm sure there are others. First, | | you determine the offsets of struct members for your desired kernel version and | | then you load them dynamically into your BPF program at runtime. For example, | | this code snippet from [12] shows how we read struct offsets from a map and use | | that in our `bpf_probe_read` to retrieve values. | | | | ```c | | static __always_inline int read_value( | | void *base, u64 offset, void *dest, size_t dest_size | | ) | | { | | /* null check the base pointer first */ | | if (!base) | | return -1; | | | | u64 _offset = (u64)bpf_map_lookup_elem(&offsets, &offset); | | if (_offset) | | { | | return bpf_probe_read(dest, dest_size, base + *(u32 *)_offset); | | } | | return -1; | | } | | ``` | | | | To actually find these offset values in the first place, we built the | | `linux-kernel-component-cloud-builder`, or `LKCCB`, which builds hundreds if | | not thousands of kernel modules with debug enabled for all our target kernel | | versions and extracts structure offset information from the LKM's `DWARF` debug | | info[1]. | +- | | +-**- Stable interfaces aren't. -**- | | | | -...- Problem -...- | | | | You'll often find, when working with the kernel, that there aren't as many | | stable interfaces as you thought there'd be. Even syscalls, which are supposed | | to be a big part of the stable user interface, aren't necessarily stable. | | | | For example, if you somehow traveled back in time and wanted to monitor process | | forks, you'd probe the `fork` syscall. That'd work fine for a bit, until | | `clone` is introduced. If you stopped paying attention, you'd lose your data | | altogether when glibc (and everything with it) switched `fork()` to be a | | wrapper around `clone()`. | | | | Maybe you want to get the arguments of a syscall. Should be easy, you're given | | `pt_regs`, just access the registers that hold the arguments! Except if you're | | on x86_64, and on a kernel older than 4.17, you'll probably be given the | | `pt_regs` of the syscall wrapper function, that in turn calls the real syscall | | function. And of course, it all shuffles around if you need to add aarch64 | | support, which has its own set of calling conventions. | | | | Sometimes a symbol that's marked as being exported can't be attached to, almost | | inexplicably. Usually this is due to GCC inlining the function, and the symbol | | being renamed to something like `symbol_name.part.1213`. Trying to bind | | `symbol_name` won't work. | | | | -...- Solution -...- | | | | For different architectures you can probably get away with macros that | | conditionally compile depending on what architecture you're targeting, and then | | building one copy per architecture. For the syscall wrappers, you can do | | something similar but build targeting different kernel versions. In practice, | | you may find you need many variants and copies of a single program, all with | | slight differences, to support different kernels and architectures. | | | | For the symbol name problem, it comes back to run early and run often. It's | | often worth spinning up a VM of a few of the kernels you're targeting and | | double checking that the locations you're hooking are indeed in | | `/proc/kallsyms`. Sometimes you'll find the functions you were looking at don't | | exist in different versions, or have been renamed and relocated. I recommend | | getting comfortable with Bootlin's Elixir cross referencer (but you still need | | to run and see, because distros do their own backports which won't match what's | | in the mainline cross referencer). | +- | | +-**- Running BPF programs as intended involves magic. -**- | | | | -...- Problem -...- | | | | If you use libbcc, libbpf, bpftrace, or any other other high level BPF tools | | you'll quickly notice that they do a lot of magic for you. BCC (the python | | interface) will more or less rewrite your programs for you so they work on your | | host system. You'll end up getting error messages on code that the library | | wrote for you. They also help you get around CO-RE limitations by compiling on | | the host, and using different tricks and kludges to get the same program code | | to operate in many environments as cleanly as possible. This doesn't help a ton | | when you need to build and distribute actual raw BPF programs in their own ELF | | file. | | | | These libraries are also pretty convoluted. There's a lot of overlap in | | maintainers between these libraries and the people working on BPF in the | | kernel, so documenting the interactions isn't a priority. But you don't | | actually need all the stuff these libraries are doing. After a while, | | especially with the rewriting, you'll find yourself wanting to write and load | | pure BPF C code. Here's a snippet of a map I put together when trying to figure | | out what syscalls were actually being made when libbpf loaded a program. | | | | ``` | | KProbe | | |-> bpf_attach_kprobe() | | |-> bpf_attach_probe() | | |-> bpf_try_perf_event_open_with_probe() | | |-> bpf_find_probe_type() | | |-> bpf_get_retprobe_bit() | | |-> syscall(__NR_perf_event_open) | | |-> create_probe_event() | | |-> enter_mount_ns() | | |-> setns() | | |-> exit_mount_ns() | | |-> setns() | | |-> bpf_attach_tracing_event() | | |-> ioctl( PERF_EVENT_IOC_SET_BPF ) | | |-> ioctl( PERF_EVENT_IOC_ENABLE ) | | |-> bpf_close_perf_event_fd() | | |-> ioctl( PERF_EVENT_IOC_DISABLE ) | | ``` | | | | These libraries are also GPL, which means your userspace program would end up | | being licensed under GPL and not just your BPF programs. As great as this is | | for users, if you work for a company that likes to make money you might not be | | allowed to touch GPL. It might even be in your contract. | | | | -...- Solution -...- | | | | If you're writing complex BPF programs for security, you'll probably want to | | write it in C without the "help" of something like BCC. You'll also want a bit | | more control and transparency when loading and attaching your programs. In my | | experience, libbpf wasn't great at cleaning up after itself and it got | | frustrating. | | | | Use a clean, simple, library built from the ground up for loading BPF in your | | language of choice. For Rust, `aya`[16] is a good one, and I worked on | | `oxidebpf`[0]. Golang also has some good options. One common theme of these | | projects is the amount of effort that went into reverse engineering the | | undocumented program loading logic and reimplementing it. Take advantage of | | that work and use it to load your programs. They're also permissively licensed! | +- | | +-**- Speed kills. -**- | | | | -...- Problem -...- | | | | After getting into BPF, you may benchmark a few of your programs and be | | surprised at how much faster BPF is than what you've been using before, like | | audit. This makes it very tempting to trace and probe more than you probably | | should. For example, if you want to trace socket closes you may be tempted to | | put a kprobe on the `close` syscall. This syscall is called all the time, and | | probing it will slow your system down unnecessarily. Most of the messages will | | be discarded since you only care about sockets. There are plenty of other | | interesting areas that can't be reasonably instrumented due to the performance | | impact. | | | | -...- Solution -...- | | | | Trace only what you need, and scope it down as much as possible. Going back to | | the `close` example, you'd be better off probing somewhere downstream where the | | individual `tcp_close` or `udp_close` functions are called. | | | | ``` | | struct proto tcp_prot = { | | // ... | | .close = tcp_close, | | // ... | | }; | | EXPORT_SYMBOL(tcp_prot); | | | | struct proto udp_prot = { | | // ... | | .close = udp_lib_close, | | // ... | | }; | | EXPORT_SYMBOL(udp_prot); | | ``` | | | | Brendan Gregg's book[7], again, has a great table that shows the overall | | performance impact of tracing different points in the kernel. You could also | | just reason intuitively about how often you think each area you want to probe | | is exercised. The more commonly a code path is executed, the more expensive it | | will be to probe it. | | | | Even after doing your best to scope down and optimize your BPF programs, you'll | | probably want to run benchmarks as you tweak things to see what performs best | | in your target environment. Flamegraphs[13] are a great way to see where most | | of your overhead is coming from, especially if combined with a benchmarker like | | UnixBench[14]. The results may surprise you. | | | | I'd also recommend processing BPF events in batches. You'll probably be sending | | out a lot of information through maps that needs to be read from userspace. If | | you're getting information that's too big for the stack, the information will | | be sent in chunks that need to be reconstructed in userspace. It's definitely | | possible to loop a blocking poll+read on the perfmap or BPF ring buffer, but | | doing so will result in significant overhead. You're much better off letting | | the buffers fill a bit, and processing them in batches (batch process, don't | | stream process). Doing that netted me significant performance gains in | | benchmarks for the BPF programs I write at work. | +- | | +-**- Don't Panic. -**- | | | | -...- Problem -...- | | | | BPF programs will generally live as long as something holds a file descriptor | | that points to them. However, sometimes you need to manually clean up after | | them (such as when using `debugfs`). If your userspace program crashes or | | panics things may not get cleaned up properly. This can lead to all sorts of | | problems when you restart your probes, such as receiving duplicate events. | | | | If you're building short lived, one off, tools this is less of a concern. But | | if you're managing several probes as part of a long-lived monitoring daemon | | then this is something you need to be careful with. | | | | -...- Solution -...- | | | | Make sure you design the userspace component of the BPF program to keep your | | programs alive for as long as you'll need them. Gracefully handle all errors in | | the thread that keeps your BPF programs alive and make sure you clean up after | | yourself in the event of failure or shutdown. Keep in mind that many older BPF | | tools are built around short-lived programs, meant for things like `bpftrace` | | or production debugging. | +- | | +-**- Know your limits. -**- | | | | -...- Problem -...- | | | | Your program will have instructions and will probably use maps. These take up | | space, which the BPF syscall will handily memlock for you. On many distros, | | this is fine. On others, however, the default memlock ulimit is quite low[15]. | | See the following output of `ulimit -l` on various distributions. | | | | ``` | | vagrant@ubuntu2004:~$ ulimit -l | | 65536 | | [vagrant@centos7 ~]$ ulimit -l | | 64 | | vagrant@opensuse15:~> ulimit -l | | 64 | | ``` | | | | If you can't memlock enough memory to fit your instructions and maps, you'll | | get rejected with cryptic verifier error messages. | | | | -...- Solution -...- | | | | Calculate the amount of memory your programs and maps will need, and check the | | memlock limits on your target systems. You may be fine, or you may need to | | raise it first. Some libraries (like the one we wrote![0]) can try to take care | | of this for you. | +- | | +-**- Tail calls aren't guaranteed. -**- | | | | -...- Problem -...- | | | | Think of tail calls like the BPF equivalent of `execve`, except less powerful. | | It'll start running a new probe, with the original context argument, and | | replace everything you were previously doing. You can't provide it with custom | | arguments, and the tail call needs to pass the verifier independently. This | | means if you want to communicate between tail calls you'll need to use maps. | | You're also limited to chaining 33 tail calls in a single execution, after that | | the tail call execution will simply fall through. | | | | You can't call into another program with a tail call directly, either. You need | | to reference an index in a tail call program map (a type of BPF map) which | | needs to be set from userspace. For example, if you want to tail call from | | `prog_a()` into `prog_b()`, you'll need to load `prog_a()` and `prog_b()` | | first. At this point if `prog_a()` fires, the tail call into `prog_b()` will | | fizzle. Then, from userspace, you need to update a map to say "`prog_b()` is at | | index 5, if anyone tries to tail call into index 5, send them to `prog_b()`." | | Tracking and maintaining all these indexes can be cumbersome. | | | | And there's not always a guarantee that the tail call will fire. You could | | reach an execution limit, or a memory limit, or some other weird verifier edge | | case that prevents the tail call from firing. Your programs need to handle this | | gracefully. | | | | -...- Solution -...- | | | | First, you'll have to write your tail calls as though they were independent | | programs. Think of designing each one to grab a different bit of information | | you're looking for. If you find yourself re-calculating the same things in each | | program or otherwise need to communicate across calls, store and retrieve | | information from a map. | | | | For managing indexes, use an enum for your tail calls and reference that from | | your userspace application. For example, have an `enum tail_calls { PROBE_A, | | PROBE_B };` and then reference it from inside your programs and when loading | | the program map from userspace. The file descriptor for `probe_a()` goes at | | index `PROBE_A`, and so on. If you want to call into `probe_a()`, you get there | | by asking for `PROBE_A` with `bpf_tail_call(ctx, &tail_call_table, PROBE_A);`. | | | | Your program should also have a plan for what happens if the tail call doesn't | | go off. Do you want to send up an error? Ignore it? Send up a message that | | execution was completed? Something else? For example, if you're using recursive | | tail calls to read a string value you may want to return a message that says | | you hit your tail call limit before you finished reading the string. | +- | | +-**- You can't just return what you want. -**- | | | | Alternatively, you're on your own with error handling. | | | | -...- Problem -...- | | | | This one was a problem that we didn't even realize we had because it was so | | subtle. In C it's pretty customary to return `0` on success and `-1` (or some | | other negative error code) in the event of a failure. The actual returned value | | is usually written to a buffer or some other pointer given as a function | | parameter. You check the return code for success or failure and take actions | | appropriately (in theory). While writing BPF programs in C, especially kprobes, | | you might be tempted to follow this pattern. After all, it makes sense. The | | actual value you return is sent out through perf or written into a map so | | userspace can grab it, so the return value of the probe itself should be `0` to | | indicate success or `-1` to indicate failure, right? Just like every other C | | program? Wrong! For program types other than kprobes (remember, BPF is just an | | execution environment) it's more obvious that the return codes have special | | meaning. For example, XDP programs have explicit return codes to drop, pass, | | or re-process packets. | | | | For kprobes, `return 0;` means "I'm done with this kprobe, you can move on." | | You indicate that you want to keep the probe hanging around with _literally any | | other return code_ (including `return -1;`). That's probably not what you | | want. Take a look at this function from the kprobe handler in [18]: | | | | ```c | | /* Kprobe profile handler */ | | static int | | kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) | | { | | // ... | | if (bpf_prog_array_valid(call)) { | | // ... | | ret = trace_call_bpf(call, regs); | | // ... | | if (!ret) | | return 0; | | } | | | | head = this_cpu_ptr(call->perf_events); | | if (hlist_empty(head)) | | return 0; | | | | dsize = __get_data_size(&tk->tp, regs); | | __size = sizeof(*entry) + tk->tp.size + dsize; | | size = ALIGN(__size + sizeof(u32), sizeof(u64)); | | size -= sizeof(u32); | | | | entry = perf_trace_buf_alloc(size, NULL, &rctx); | | if (!entry) | | return 0; | | | | entry->ip = (unsigned long)tk->rp.kp.addr; | | memset(&entry[1], 0, dsize); | | store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize); | | perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs, | | head, NULL); | | return 0; | | } | | ``` | | | | There's two things you should notice in that snippet. The line `ret = | | trace_call_bpf(call, regs);` and `if (!ret) return 0;`. That means if | | `trace_call_bpf()` returns _anything but `0`_ (including `-1`) we go through | | the remainder of the function, buffer allocation, trace argument storage, and | | so on. We can grab the internals of that function at [19]: | | | | | | ```c | | /** | | * trace_call_bpf - invoke BPF program | | * @call: tracepoint event | | * @ctx: opaque context pointer | | * | | * kprobe handlers execute BPF programs via this helper. | | * Can be used from static tracepoints in the future. | | * | | * Return: BPF programs always return an integer which is interpreted by | | * kprobe handler as: | | * 0 - return from kprobe (event is filtered out) | | * 1 - store kprobe event into ring buffer | | * Other values are reserved and currently alias to 1 | | */ | | unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) | | { | | unsigned int ret; | | | | // ... | | | | /* | | * ... | | */ | | ret = BPF_PROG_RUN_ARRAY(call->prog_array, ctx, bpf_prog_run); | | | | // ... | | | | return ret; | | } | | ``` | | | | As you can see, this is the function that actually invokes the kprobe. It gets | | `ret`, which it returns, from `BPF_PROG_RUN_ARRAY()` which, as you might | | expect, runs the BPF program. The documentation on this function is also pretty | | explicit, which is nice. When we return `0`, we've returned from the kprobe and | | don't need to keep any details about it hanging around. When we return `1` | | (which anything besides `0` aliases to) we store information about the kprobe | | in a ringbuffer for later. | | | | -...- Solution -...- | | | | The solution here is to always `return 0;` in your kprobes, unless you have an | | explicit need to `return 1;`. If you want to know if your kprobe failed or is | | in some incomplete state, you'll need to architect your message-passing to | | handle that case. For example, you might want to include a success code flag in | | the struct(s) you pass through a perfmap which you can check for failure. Or | | you might want to build your system around a best-effort event reconstruction | | for more complicated returns involving multiple messages. In any case, you'll | | have to engineer your error checking and handling independently of the BPF VM | | system. Those return codes are reserved, you gotta make your own. | +- +- +-#- Wow, that looks hard. Can you summarize it for me? -#- | | Certainly! BPF is really good at getting visibility (almost) anywhere in the | entire system, it allows dynamic reinstrumentation, can be made container | aware, is faster than alternatives, and is (usually) safe to run as long as you | can load it. Consider using BPF if any of the following apply to you: | | - You have the in-house resources and expertise to build and maintain a | long-lived BPF telemetry source. | | - You only want to use BPF for live debugging or other short-lived, one off, use | cases such as bpftrace. | | - You're lucky enough to only have to support a single kernel version. | | - You don't really care if the project succeeds, you just want to get experience | with BPF (this might legitimately apply in R&D orgs). | | If you don't have the resources and need to support a wide range of kernels, | you might be better off looking for an alternative (there are many free and | open source options thanks to GPL), or paying someone else to build it for you. | | Long running BPF programs for security are a relatively new use case. The | tooling around this use case is getting better all the time, but there's still | a lot to consider before diving in. +- +------[references]--------------------------------------------------------------------------+ | [0]: [https://github.com/redcanaryco/oxidebpf] | | [1]: Will go public Soon(TM) at | | [https://github.com/redcanaryco/linux-kernel-cloud-component-builder] | | [2]: [https://github.com/redcanaryco/redcanary-ebpf-sensor] | | [4]: [https://lwn.net/Articles/870269/] | | [5]: [https://lwn.net/Articles/812503/] | | [6]: [https://elixir.bootlin.com/linux/latest/source/kernel/trace/bpf_trace.c#L646] | | [7]: [https://www.brendangregg.com/bpf-performance-tools-book.html] | | [8]: [https://github.com/vbpf/ebpf-verifier] | | [9]: [https://github.com/iovisor/bcc/issues] | | [10]: [https://en.wikipedia.org/wiki/0x10c] | | [11]: [https://lwn.net/Articles/752047/] | | [12]: [https://github.com/redcanaryco/redcanary-ebpf-sensor/blob/main/src/programs.c#L393] | | [13]: [https://github.com/brendangregg/FlameGraph/] | | [14]: [https://github.com/kdlucas/byte-unixbench] | | [15]: [https://linux.die.net/man/5/limits.conf] | | [16]: [https://github.com/aya-rs/aya] | | [17]: [https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=BPF] | | [18]: [https://elixir.bootlin.com/linux/latest/source/kernel/trace/trace_kprobe.c#L1568] | | [19]: [https://elixir.bootlin.com/linux/latest/source/kernel/trace/bpf_trace.c#L95] | +--------------------------------------------------------------------------------------------+