┌───────────────────────┐ ▄▄▄▄▄ ▄▄▄▄▄ ▄▄▄▄▄ │ │ █ █ █ █ █ █ │ │ █ █ █ █ █▀▀▀▀ │ │ █ █ █ █ ▄ │ │ ▄▄▄▄▄ │ │ █ █ │ │ █ █ │ │ █▄▄▄█ │ │ ▄ ▄ │ │ █ █ │ │ █ █ │ │ █▄▄▄█ │ │ ▄▄▄▄▄ │ │ █ │ Mutable Programs - A guide to simple polymorphism │ █ │ ~ gorplop └───────────────────█ ──┘ Wouldn't it be kewl to write a program which creates its copy, but with different code? Right? Read on... .------------------, / Mutable Programs .-------------------------------. '-------------------\ A guide to simple polymorphism \ '--------------------------------' When we talk about self-modifying programs, we usually think about software which modifies it's own code during execution. For example, a programmer may choose to obfuscate a piece of code by removing a jump instruction, or placing some dummy instruction, and then when the program is run, an earlier piece of code does some operations to undo these changes. In modern operating systems there are some obstacles against modifying the executable memory regions directly, but there are ways around it. In this text I would like to explore the other kind of self-modifying code. Instead of programs which manipulate their own copies in memory, I would like to discuss programs which are prepared in such a way, that they can be easily (programatically) mutated to create a new copy, with different file contents, that does the same thing when executed. This in itself is not a virus, but such code can be used to create one. The code presented in this paper is harmless and the implementation simplified, to show the general idea. It is however functional and executable. .-------------------. \ Polymorphic virii \ '-------------------' Classic polymorphism is best described in "Win32 Polymorphism" by Billy Belcebu/iKX [belcebu00]. The main goal of the technique is to evade detection by removing constant byte streams in the virus code, as an answer to pattern matching ("scanstring") techniques of contemporary AV software. At that point Win32 and DOS virii were encrypted, each copy with a different key to minimize the recognizable data size. The part which was not changing between the virus copies was the decryptor code. To solve that, some virii started transforming the decryptor code as well [watson92]. The new technique was called polymorphism. Polymorphism is rewriting the virus code, so that the underlying instructions are different while the code does the same thing. A number of techniques were developed and the reader is advised to read that work, even if it is specific to x86 or not directly related to writing UNIX malware. In this paper I will be focusing on instruction level polymorphy (level 3 and 4 according to iKX). There are a few techniques which can be generalized to any architecture: 1. Replacing a single instruction with several which do the same thing 2. Inserting dummy instructions (nop, add/sub zero from a register, etc.) 3. Changing the order of the instructions without changing the high level program flow. Let's abstract from decryptor codes and loaders and all that stuff, instead, focusing just on the polymorphy part. Let's consider what would be necessary for a program to apply these techniques to another piece of code, to produce a new program which does the same thing, but is encoded differently. Then we can apply the program to itself to create a polymorphic program. First, the program needs to be able to distinguish the individual instructions from the byte stream. This is a hard task on x86, if not impossible in the general sense. The x86 architecture is a variable instruction length architecture and to make things harder there are many ways to encode the same instruction. The next requirements depend on the particular technique. For 1. and 2., the program needs to keep track of how many new bytes were inserted (or removed) and then correct all forward and backward references (jumps, loads, etc.). This is a very difficult task to do on binaries, because you need to recreate all the information lost during compilation. This means that at least in part, the mutation engine must have some functionality of an analyzing disassembler and track the references, then relink the code fragments correcting all references. Technique 3 is also difficult - assembly programming is by definition imperative, so in general it is not possible to rearrange the instructions and get the same high-level program flow. In some cases it can be done, as long as we do not break the dependency chains. It is done in hardware when possible, in processors with out-of-order execution. The problem is so complex, that we had at least two major security holes in the last years in a similar mechanism (speculative execution) on x86 alone. The following examples illustrate the idea and show when the instructions can be swapped: ;; Example 1 ;; call open(path, flags, mode), x86_64 mov rax, 2 lea rdi, path mov rsi, flags mov r10, mode syscall ;; the first 4 instructions can be executed in any order, or even at ;; the same time, if memory allows. ;; Example 2 ;; on some unspecified RISC architecture load rA, loc1 addi rA, rA, 2 load rB, [loc2+rA] add rA, rA, rB stor loc3, rA ;; these instructions cannot be reordered nor executed in parallel ;; because of dependency chains ;; nb. such code will be very slow on pipelined architectures. In the first example the 4 instructions before the syscall do not depend on each other. In the second example, every instruction depends on the result of the previous one, so you cannot reorder them. Changing the order of these instructions will result in a different program that will most likely crash. The dependencies between registers are explicit, as source and destination registers are written in the assembly code. In this case, rA is the destination register of the first load, and the source register of the first addi. rA value is also used in the source of the second load. Because of this, the instructions need to be executed in the order they are written in, because each one depends on the result of the previous. A true mutation engine also needs to keep track of the implicit dependencies between instructions. Implicit dependencies are the ones which result from the instructions themselves. Branches are one type of instructions with implicit dependencies. Another example of those are calls, all stack instructions, all test instructions, and all flag modification instructions. In x86 many general purpose instructions, like div, mul, loop, string instructions (stos, scas, lods, cmps) use an implicit register. If you are coding a polymorphism engine for x86, you need to take care of that as well. A good reference for these is in [386prm] chapter 3. In general, the program would need to do a lot of complicated things to implement even the simplest polymorphic techniques. It would need to understand the instruction encoding to divide the bytes between instructions. It would also need to analyze the dependency chains between instructions (and decode the instructions) to know which can be reordered and which cannot. And finally, not shown here (but it is a real problem), it would need to correct any references if the architecture uses PC-relative (program counter) addressing (because if you move the instruction, the PC value at that instruction changes, of course). .---------------------. / Hacking the problem / '---------------------' How can we simplify this problem and write a polymorphic program? In the previous paragraph I outlined 3 problems that need to be solved to implement a polymorphic virus engine. These are the following: 1. Dividing the byte stream into instructions 2. Modifying the code without breaking it 3. Fixing references in the code after changes are made First off, the problem of dividing instructions. This is simple. Throw x86 where it belongs, to the computer history museum, and use an architecture with a constant instruction length. There are many - in this text I chose 64-bit ARM, a.k.a. aarch64, because of how ubiquitous it is becoming. In aarch64 all instructions are 4-byte long and aligned on a 4 byte boundary, so the problem of dividing a stream of bytes into instructions is non-existent. I am of course kidding about x86, there are ways around variable instruction length. One of these would be to store a list of instruction lengths (precompute it) and then make the mutation engine update the list as it swaps or replaces the instructions. But let's focus on the essence of the problem, and choose aarch64 as the target. The second problem can be waved away too. If the source program is written in assembly, you can write it in such a way that a simple transform (like swapping even and odd instructions) never breaks the dependency chains. This is similar to techniques known by some experienced assembly programmers, like those writing for machines with delay slots. While writing the assembly, just make sure you can swap any pair without breaking the program flow. Let's call such code easily-mutable, that is, code prepared in a way can be changed with a simple transformation, and still work. .--------------. \ Malware in C \ '--------------' You might ask, "Gorplop, but what if I would like to write malware in C instead?". That is also possible. The C compiler can generate an assembly listing (for gcc, it's -S), and then we can edit and assemble the output. Of course, when I want to program in C, I do not want to correct the generated assembly every time I compile the program. It is a good idea to write a tool which takes the resulting assembly and transforms it in a way that guarantees code mutability. Because this tool will be a part of the build system, and not the target program, the size and complexity requirements are not that strict like for the program itself. This leaves us with the third problem. PC-relative addressing. In assembly, defining read-only data the usual way will create a constant pool near the code. A constant pool is what it sounds like. It is a memory area where constant values are stored. The machine code then uses PC-relative addressing instructions (like ldr) to load the constant to a register. Let's see how it looks in practice. A simple aarch64 assembly program like this: .text .global _start _start: // syscall 64 = write(unsigned int fd, const char *buf, size_t count) mov x0, #1 ; fd ldr x1, =hello ; buf ldr x2, =hello_len ; count movz x8, #64 svc #0 //do the syscall // syscall 93 = exit(int error_code) mov x0, #0 movz x1, #1 mov x8, #93 svc #0 .data hello: .ascii "Hello, aarch64 world!\n" hello_len = .-hello will disassemble to this: $ aarch64-linux-gnu-objdump -d 00_hello.elf 00000000004000b0 <_start>: 4000b0: d2800020 mov x0, #0x1 // #1 4000b4: 58000121 ldr x1, 4000d8 <_start+0x28> // (*) 4000b8: 58000142 ldr x2, 4000e0 <_start+0x30> // (*) 4000bc: d2800808 mov x8, #0x40 // #64 4000c0: d4000001 svc #0x0 4000c4: d2800000 mov x0, #0x0 // #0 4000c8: d2800021 mov x1, #0x1 // #1 4000cc: d2800ba8 mov x8, #0x5d // #93 4000d0: d4000001 svc #0x0 4000d4: 00000000 udf #0 4000d8: 004100e8 .word 0x004100e8 // pointer to .data 4000dc: 00000000 .word 0x00000000 4000e0: 00000016 .word 0x00000016 // constant, 0x16 4000e4: 00000000 .word 0x00000000 The ".data" directive in the assembly code does not create a .data section, instead it makes a small constant pool. If the exit syscall was not there, the CPU would run into that constant pool and start executing the data. The two ldr instructions marked with a (*) use PC-relative addressing to load a value into a register. The register number is encoded in the lowest 5 bits, the next 19 bits encode the immediate value: aarch64: LDR: Load register (literal) - instruction encoding 32 24 23 5 4 0 0 x 011000 kkkkkkkkkkkkkkkkkkk rrrrr x = 0 (32 bit) or 1 (64 bit) k = immediate[19:2] is extended by appending 00 to the right end r = register In the code above we have 58000121: 0x58 (ldr 64bit opcode), immediate +9, to register 1 58000142: 0x58 (ldr 64bit opcode), immediate +10, to register 2 To get the target address, the immediate is added to Program Counter (PC). The first ldr is at 4000b4, 0x4000b4 + 4*9 = 0x4000d8 which is the pointer to our string. Similarly, the second ldr points to 10 words later, which is where the string length of 0x16 lives. On x86 the same technique is called RIP-relative addressing, because the program counter is called Instruction Pointer (register rip or eip) on x86. Notice also how the data section is aligned on an 8-byte boundary and an empty "udf #0" instruction is shown by objdump at 4000d4. How does objdump know the constant pool starts one instruction later and disassembles it as data instead of instructions? This will be explained at the end of the text, so stay tuned! On the other hand, programs generated from C code will use data section for constants, so no PC-relative addressing is used for fetching constants (unless PIC). However, there is another important bunch of instructions that uses PC-relative addressing, these are jumps (or branches, as ARM architecture calls them). It is possible to write branchless code. Conditional instructions exist on ARM (ITTT instruction), and x86 has cmov extension, used by movfuscator, but compilers usually emit branch instructions. This means that our mutation engine needs to know where the branch instructions are. It can detect them, by implementing a very simple analyzing disassembler (this theme will repeat a few times in the text). Another way is to, again, store a precomputed table of these, which will then have to be updated as instructions get modified. In case of aarch64 the first way seems like a smaller and simpler solution. .----------------. / Implementation / '----------------' The implementation needs to solve the three problems we earlier identified. The first one we waived away for now, by limiting ourselves to constant instruction length architectures. This leaves us with RISC: ARM, MIPS, RISC-V, SPARC to name a few in current use. The second problem was moved to the toolchain. Our program will deal only with mutable binaries, that is, binaries prepared so that pairs of instructions can be swapped. This is done either by carefully writing the assembly, or automatically patching the assembly generated from C to fulfill this requirement. I want us to stop here for a moment. Let's think about how we can make sure the input program will be mutable by a simple pair swapping algorithm. There are a few ways to do this. We can move part of the complexity to the toolchain and write what essentially is a special analyzing disassembler. This program would need to parse the generated assembly (or disassemble an existing ELF file), trace the explicit and implicit dependencies, then insert dummy instructions in places where swapping instruction pair would break the dependencies. To keep the reader busy, this example will be shown in x86_64, but again, this works for any architecture. // if((fd = open(file_path, O_RDWR | O_CREAT, S_IRWXU)) == -1) // goto error; mov edx, 0x1c0 ; mode_t mode mov esi, 0x42 ; int flags lea rdi, file_path ; char *pathname call open ; open(char* pathname, int flags, mode_t mode) cmp eax, -1 je error Here the first three instructions can be executed in any order. The next instructions cannot. The program which transforms this assembly sequence into a mutable program must detect these dependencies remove them. One of the possible outputs of such program is shown below: mov edx, 0x1c0 ; even mov esi, 0x42 ; odd - can swap, no change lea rdi, file_path ; even add rdi, 0 ; odd - inserted dummy call open ; even sub eax, 0 ; odd - inserted dummy cmp eax, -1 ; even jge .+2 ; odd - dummy, but cannot affect flags for jump je error ; even xchg rcx, rcx ; odd - inserted dummy Inserting dummy instructions is only one of the answers to this problem. Some other ways to do this are described in other polymorphic papers (see at the end of the text). The last problem is easy to solve. We just have to find the PC-relative instructions. Then we can adjust the offset in the instruction. If the instruction was odd, it will now be even, so we add 1 word to the offset. If the instruction was even, it will now be odd, so we subtract 1 word from the offset. If you noticed that the destination instruction will also be a pair that may or may not be swapped, bear in mind that one of the instructions will be a no-operation, so we will either jump into the target instruction, or the empty instruction before the target. ;; Branch is adjusted by +1 word 0x4080 dummy .-- 0x4080 bl .+0x14 .-- 0x4084 bl .+0x10 : 0x4084 dummy : ... : '-> 0x4094 insn '-> 0x4094 insn or dummy 0x4098 dummy 0x4098 dummy or insn ;; If using dummy no-op instructions, swapping a pair with a branch only ;; requires to adjust the branch offset. The jump will reach the target ;; instruction in either case .----------------. \ The nopstuffer \ '----------------' mut1 source package comes with a script called nopstuffer, which is used to convert an assembly listing to an easily-mutable program. This version of the script is extremely simple, to not say naive. All it does is insert a nop instruction between every instruction in the source assembly listing. The only use for it is for testing and converting gcc assembly output to an easily mutable ("stuffed with nops") program. Despite that, mut1 with nopstuffer can create and mutate working executables from C or assembly code. The code will look obviously suspicious to anyone with a disassembler, and it makes the text size grow twice. But this is a base on which a more complicated, analyzing nopstuffer can be written on. We are doing this in the build stage, so there are no code size requirements. We could analyze the assembly, trace the dependencies, and replace the instructions or insert empty ones accordingly. Let's say improving the nopstuffer is left as an exercise for the adept. There is more than one way to do this. Show me something good! .-------------------------------. / Putting it all together: mut1 / '-------------------------------' The implemented program, mut1, takes a mutable source ELF file and outputs a mutated ELF file in which the entire text section has even and odd instructions swapped with a random probability. This creates a lot of possible combinations of bytestreams that all result in the same program. The tools are released with this zine, look for the github link at the end of this file. By using nopstuffer you can write the code in C as well. The toolchain starts with a C file, which is compiled to an assembly listing. Then, nopstuffer is ran to "stuff" the listing with empty instructions between each real one. The resulting file is compiled to an ELF called "stuffed". The stuffed ELF is a regular executable and it should work the same as the original program. You can also use mut1 on executables assembled from any assembly listing which satisfies the dependency requirements we described earlier. The next step is to run the mutation program, mut1. mut1 loads the stuffed ELF, parses it and finds the text section. The parser is very simple and will break bad if you feed it an unusual ELF file. Then, mutate_xrefs is called on the text section. This function is the heart of the mutation engine. The simplified main() looks like this: int fd = open(filename, O_RDONLY); uint8_t* elf = mmap(fd, ...); mutate_xrefs(elf); // mutfile = filename + ".mut" int mutfile_fd = open(mutfile, O_RDWR | O_CREAT, 0600); write(mutfile_fd, elf, file_stat.st_size); mutate_xrefs iterates over the text section, swapping each instruction pair with a random probability. As mentioned before, it implements a very minimal piece of logic which could be called an analyzing disassembler. The swapped instructions are compared against common compiler-emitted PC-relative instructions and the encoded offsets are swapped accordingly. Below is a code fragment that deals with branches: #define B_MASK 0xfc000000 #define BL_OPCODE 0x94000000 #define B_OPCODE 0x14000000 // inside loop that iterates over text section of the ELF // bool swap - true if the pair is to be swapped (random) // bool second - true if the instruction is second in pair uint32_t insn = code[i]; // Detect and adjust branch instructions if((insn & B_MASK) == BL_OPCODE || (insn & B_MASK) == B_OPCODE) { offset = (insn & (~B_MASK)) << 2; DEBUG(" b/bl imm26: ofs=%08x ", offset); if(swap) { // patch the instruction offset += second ? 4 : -4; DEBUG("new ofs=%08x", offset); // encode new offset insn &= B_MASK; insn |= offset >> 2; DEBUG(" in=%08x", insn); } } code[i] = insn; If the instruction pair is to be swapped, the code adjusts the 26-bit immediate value in the instruction. The offset is adjusted plus or minus 1 machine word (4 bytes). ARM often encodes the immediates without the lower 2 bits (code is always 4-byte aligned), which is why the offset is shifted right by 2 before being encoded back in the instruction. Other instructions are dealt with in a similar way. The resulting ELF file gets written to disk with a ".mut" extension appended. This file is executable and should work the same way as the original program, despite having different code. The checksum of the .text section and of the entire file is different. If the swap bool is hardcoded to true, then running mut1 on the mutated code again restores the original executable. The target architecture of the program is aarch64, but my development machine is still x86. Yours probably too. Because of that, the programs are cross-compiled, and the source package that goes along with this text contains a script for setting up a Debian 12 aarch64 virtual machine in QEMU. This is what I was using during development. You can test must1 by compiling the examples and then copying them over scp into the aarch64 VM, or setting up cross-platform emulation around qemu-user-static. .---------------------------. \ Summary and afterthoughts \ '---------------------------' The adept is advised to study other polymorphism papers [belcebu00, watson92, flush99], keeping in mind that the ideas in them (especially older ones) are based on x86 Win32 virii, but they can be extended to any other architecture. RISC load-store architectures are simpler to write polymorphic malware for because of their design. There are little implicit dependencies, unlike on x86 or other CISC architectures. The verbosity of RISC assemblers should allow for easier dependency tracing without having to use complicated code analysis engines. In this group, RISC-V deserves a special shoutout, because it has a very simple instruction encoding scheme, with only 6 base instruction formats. This simplifies parsing and analysis, which should help writing a compact polymorphic engine. While writing this tool, at first I relied on detecting constant pools using the hidden symbols "$d" and "$x" left by GNU Assembler (you can see them by passing --show-all-symbols to objdump). Initially this worked and relying on these symbols let me process the text section almost statelessly. It however broke for assembly emitted from GCC and I had to implement the linked list to store used constant pool locations. These hidden ELF symbols are also the answer to how objdump can distinguish constant pools from instructions. The example shown before, this time with --show-all-symbols, looks like this: $ aarch64-linux-gnu-objdump -d 00_hello.elf --show-all-symbols 00_hello.elf: file format elf64-littleaarch64 Disassembly of section .text: 00000000004000b0 <_start>: 4000b0: d2800020 mov x0, #0x1 // #1 4000b4: 58000121 ldr x1, 4000d8 <_start+0x28> 4000b8: 58000142 ldr x2, 4000e0 <_start+0x30> 4000bc: d2800808 mov x8, #0x40 // #64 4000c0: d4000001 svc #0x0 4000c4: d2800000 mov x0, #0x0 // #0 4000c8: d2800021 mov x1, #0x1 // #1 4000cc: d2800ba8 mov x8, #0x5d // #93 4000d0: d4000001 svc #0x0 4000d4: 00000000 udf #0 00000000004000d8 <$d>: 4000d8: 004100e8 .word 0x004100e8 4000dc: 00000000 .word 0x00000000 4000e0: 00000016 .word 0x00000016 4000e4: 00000000 .word 0x00000000 More experimenting with the assembly source file will reveal that code segment is marked with an $x symbol. This symbol exists only in GNU Assembler and not in code emitted by GCC. The current version of mut1 does not use these symbols, but they might prove useful in some other project. As a closing thought, I would like to tell you that writing malware is like walking up a large mountain. There are many paths to take and not all lead to the top. Sometimes one path contains an obstacle which is too hard to go through. Taking another path to find out it does not lead anywhere might give you just enough experience to come back and pass the first one. It is a good idea to try various things and be wary of fixating on one specific solution or way of doing things. Stay cool, write code and hack all the things. Congratulations for making it through this text. I hope you liked it. If you have any q's or just want to say hi, email me at gorplop@sdf.org The mut1 code and other tools are available on github, at https://github.com/gorplop/mut1 Greetz to: netspooky, smelly of vxug, deluks, all tmp.0ut staff and all those before me. .--------------. > References < '--------------' belcebu00: Section 12 in "Billy Belcebu Virus Writing Guide 1.00 for Win32", 29A4#4 watson92: "A Discussion of Polymorphism", Gary Watson, https://cryptohub.nl/zines/vxheavens/lib/agw00.html flush99: "Other techniques of polymorphism", flush/MGL, Asterix #2 http://gbppr.net/cracking/vtc/vdat/epothpol.htm 386prm: Intel 80386 programmer's reference manual, https://css.csail.mit.edu/6.858/2014/readings/i386.pdf --[ PREV | HOME | NEXT ]--