┌───────────────────────┐ ▄▄▄▄▄ ▄▄▄▄▄ ▄▄▄▄▄ │ │ █ █ █ █ █ █ │ │ █ █ █ █ █▀▀▀▀ │ │ █ █ █ █ ▄ │ │ ▄▄▄▄▄ │ │ █ █ │ │ █ █ │ │ █▄▄▄█ │ │ ▄ ▄ │ │ █ █ │ │ █ █ │ │ █▄▄▄█ │ │ ▄▄▄▄▄ │ Cramming a Tiny Program into a Tiny ELF File: │ █ │ A Case Study │ █ │ ~ lm978 └───────────────────█ ──┘ -- Introduction ---------------------------------------------------------------- Last fall, I was rereading Brian Raiter's article on tiny ELF programs [1], when I decided to try the challenge myself and create the tiniest x86_64 ELF program on Linux; I was unaware at the time of others' attempts to answer the question. To avoid missing any edge cases that could help, I decided to work from first principles, reading the source of the Linux kernel's ELF loader and deriving the requirements from that. At first, after ruling out all smaller offsets of the program header, I concluded that offset 0x18 was the smallest possible, yielding an 80-byte program that returns 42: 00000000: 7f45 4c46 b0e7 40b7 2a0f 0500 0000 0000 .ELF..@.*....... 00000010: 0200 3e00 0000 0000 0100 0000 0100 0000 ..>............. 00000020: 1800 0000 0000 0000 1800 0000 0100 0000 ................ 00000030: 0000 0000 0000 3800 0100 0000 0000 0000 ......8......... 00000040: 0100 0000 0000 0000 0000 0000 0000 0000 ................ (This 80-byte size was already known by Brian Raiter and a few others [2], though not widely publicized.) This summer, I found Nathan Otterness's 2021 article [3] concluding that 105 bytes is the minimum size, and to put the issue to rest, I decided to write my own article with a full analysis of the 80-byte minimum. However, in the process of repeating my analysis, I remembered that I had only considered executables of type ET_EXEC, and not ET_DYN. As it turned out, this was a mistake: with ET_DYN, the program header can just barely be placed at offset 0xc. Annoyingly, the entry point is fixed to the 'jg .+0x47' at the very start of the file, so another jump has to be placed at offset 0x47, past the end of the headers. Overall, this yields a 73-byte program that returns 42: 00000000: 7f45 4c46 b0e7 40b7 2a0f 0500 0100 0000 .ELF..@.*....... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 0000 0000 0000 3800 ..............8. 00000030: 0100 0000 0000 3800 0100 0000 0000 0000 ......8......... 00000040: 0000 0000 0000 00eb bb ......... Unless I made another mistake, this is the absolute minimum size for an ELF file containing an x86_64 program, modulo the precise definition of a "program". I plan to publish the full analysis later this year. -- A Simple Challenge, and a Straightforward Solution -------------------------- But once I had established the minimum size, I was inspired by Otterness's article to turn it into the tiniest x86_64 Hello World program. The natural rules for this challenge are pretty simple: write the 14 bytes "Hello, world!\n" to stdout, exit with code 0, and produce no other side effects. Of course, the actual process of golfing the program is far from simple. First, let's start with our 80-byte pattern, which doesn't include a mandatory instruction at offset 0x47. This allows us to put our entire 14-byte string at offset 0x48, immediately following phdr->p_memsz. To place our instructions so that they won't interfere with the header fields, we can use a template that I created in my analysis: 00000000: 7f45 4c46 **** **** **** **** **** **** .ELF............ 00000010: 0200 3e00 **** **** 0100 0000 pppp pp00 ..>............. 00000020: 1800 0000 0000 0000 1800 0000 pppp pp00 ................ 00000030: **** **** **** 3800 0100 qqqq qqqq qq00 ......8......... 00000040: 0100 qqqq qqqq qq00 **** **** **** **** ................ Here, the p, q, and * characters are all placeholders. There are a couple extra restrictions on their values for the program to be loadable. 0xpppppp must be odd, so that phdr->p_flags includes PF_X. Also, if the system doesn't support 5-level paging, then 0xpppppp can't be larger than 0x007fff, and 0xqqqqqqqqqq can't be larger than 0x007fffffff. The template has enough space for us to write our program with a straightforward instruction sequence, with no special tricks necessary, apart from using 'pop rdi' to set rdi to 1. The value comes from argc, as saved onto the initial process stack by the kernel. Adapting this sequence to work when argc != 1 is left as an exercise to the reader. 0x1: rex.RB rex.WR rex.RX /* "ELF" in ELFMAG */ mov al, 1 /* nr = 1 = __NR_write */ pop rdi /* fd = argc = 1 */ lea rsi, [rip+0x3a] /* buf = "Hello, world!\n" */ jmp 0x14 0x14: mov dl, 14 /* count = 14 */ jmp 0x30 0x30: syscall /* write(1, "Hello, world!\n", 14) */ mov al, 231 /* nr = 231 = __NR_exit_group */ jmp 0x3a 0x3a: xor edi, edi /* error_code = 0 */ syscall /* exit_group(0) */ 0x48: .ascii "Hello, world!\n" This yields an 86-byte Hello World program: 00000000: 7f45 4c46 b001 5f48 8d35 3a00 0000 eb04 .ELF.._H.5:..... 00000010: 0300 3e00 b20e eb18 0100 0000 0100 0000 ..>............. 00000020: 1800 0000 0000 0000 1800 0000 0100 0000 ................ 00000030: 0f05 b0e7 eb04 3800 0100 31ff 0f05 0000 ......8...1..... 00000040: 0100 31ff 0f05 0000 4865 6c6c 6f2c 2077 ..1.....Hello, w 00000050: 6f72 6c64 210a orld!. Clearly, this is the tiniest program we can write while keeping the string in one piece: the template doesn't let us put it any earlier in the file. Even our offset-0xc pattern doesn't help: there are only 11 bytes of free space before the mandatory jump instruction, so the 14-byte string would have to be placed after it, at offset 0x49. If we want to make the program any tinier, we'll have to divide the string into multiple chunks in the file, then reassemble them at runtime. -- Division and Reassembly ----------------------------------------------------- When we divide the string, it makes sense to use the offset-0xc pattern instead of the longer offset-0x18 pattern. The template for this offset includes a long block of 28 immutable header bytes after the initial free space: 00000000: 7f45 4c46 **** **** **** **** 0100 0000 .ELF............ 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 **** **** pppp pppp ................ 00000030: pppp pp00 qqqq 3800 0100 rr00 **** **** ......8......... 00000040: **** **** .. In this case, we have phdr->p_filesz = 0x00pppppppppppppp and phdr->p_memsz = 0x00rr00010038qqqq, with three different alternatives for their values. Either 1. the file size <= phdr->p_filesz <= 0xfff; 2. phdr->p_filesz == phdr->p_memsz; or 3. phdr->p_filesz < phdr->p_memsz, and (phdr->p_filesz & 0xfff) == 0xff4. In practice, we'll be using the 3rd alternative, so that we can store additional instructions within phdr->p_filesz. Once again, if the system doesn't support 5-level paging, then the two values must both be at most 0x00007fffffffffff (or rather, a few megabytes less, to leave room for the process stack). Also, if the system *does* support 5-level paging, and phdr->p_memsz is above that size, then the vm.overcommit_memory sysctl must be set to 1 ("always overcommit"), to allow the kernel to map enough terabytes of read-write memory. When reassembling the string from two parts, we need to write it into a region of writable memory. This leaves us a choice: should we push it onto the stack, or should we write it into the program image, which is mapped as WX? The stack can be written to very easily, using the family of 'push' instructions. But the program image already contains one of the parts, leaving only the other part to be written. For now, let's start with an approach writing to the program image. We can actually still join the two parts with a 'push' instruction, if we first use a 'pop rsp' to replace the stack pointer with a pointer into the image. To obtain this pointer, we can replace a 'jmp rel8' with a 'call rel32', adding only 3 bytes to the instruction sequence, down from the 7 bytes required by a typical 'lea r64, [rip+disp32]'. Then, to minimize stack wrangling, we can place the 'call' directly before "orld!\n", so that 'pop rsp; push r64' is sufficient to join the two parts. Finally, prior to the 'call', we can use an ordinary 'mov r64, imm64' to load the first part "Hello, w" into a register. When fitting the instructions into the template, we must be careful to satisfy the 0xff4 requirement on phdr->p_filesz: its first byte must be 0xf4, and its second byte must be 0x0f, 0x1f, ..., or 0xff. Best utilizing these two bytes is one of our main goals in arranging the sequence. The first byte can be either skipped past with a dummy instruction like 'test al, 0xf4' (a8 f4), or used at the end of a 3-byte encoding of 'mov rsi, rsp' (48 8b f4). Meanwhile, the second byte can be used for 'syscall' (0f 05), 'pop rdi' (5f), or any of the multibyte encodings of 'pop r/m64' (8f ...) or 'push r/m64' (ff ...). Altogether, we have a somewhat twisty instruction sequence, starting with a 'jrcxz 0x3c' to jump to the first instruction of the trailing free space, and using both 'mov rsi, rsp' and 'pop rdi' to cover the 0xff4 in phdr->p_filesz. The resulting file takes 84 bytes. 0x0: jg 0x47 /* "\177E" in ELFMAG */ 0x47: jrcxz 0x3c 0x3c: pop rax /* nr = argc = 1 = __NR_write */ mov rcx, 0x77202c6f6c6c6548 /* rcx = 'Hello, w' */ jrcxz 0x3c /* (not taken) */ call 0x28 /* [rsp] = "orld!\n" */ .ascii "orld!\n" 0x28: pop rsp /* rsp = "orld!\n" */ push rax /* [rsp] = 1 */ mov rsi, rsp /* 48 8b f4 */ /* buf + 0x8 = "orld!\n" */ pop rdi /* 5f */ /* fd = 1 */ push rcx /* buf = "Hello, world!\n" */ mov dl, 14 /* len = 14 */ jmp 0x4 0x4: syscall /* write(1, "Hello, world!\n", 14) */ mov al, 231 /* nr = 231 = __NR_exit_group */ xor edi, edi /* error_code = 0 */ syscall /* exit_group(0) */ 00000000: 7f45 4c46 0f05 b0e7 31ff 0f05 0100 0000 .ELF....1....... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 5c50 488b f45f 51b2 ........\PH.._Q. 00000030: 0eeb d100 0000 3800 0100 d200 5848 b948 ......8.....XH.H 00000040: 656c 6c6f 2c20 77e3 f3e8 daff ffff 6f72 ello, w.......or 00000050: 6c64 210a ld!. Since phdr->p_memsz = 0x00d2000100380000, this program requires 5-level paging to run. Luckily, even if you don't own a recent Intel Xeon processor that supports it natively, you can emulate 5-level paging with QEMU and a guest that enables CONFIG_X86_5LEVEL in its kernel. Personally, I use a premade Arch Linux VM image [4], since it is small and quick to boot. Emulating 5-level paging in QEMU just takes an extra argument to set the CPU model: $ qemu-system-x86_64 -cpu qemu64,la57 -m 512 Arch-Linux-x86_64-basic.qcow2 [arch@archlinux ~]$ sudo sysctl vm.overcommit_memory=1 vm.overcommit_memory = 1 [arch@archlinux ~]$ ./program; echo $? Hello, world! 0 At this point, our main bottleneck for improving the size is the big fat 5-byte 'call rel32' before the "orld!\n" part; it blocks us from moving that part any earlier in the file. One approach to avoid the call is to use two imm64 operands to load the two parts, but that would require placing the 'mov' instruction for the "Hello, w" part at offset 0x49 or later, and ultimately it wouldn't end up saving any bytes at all. If only we had some way to separate the strings from the instructions, so as to use fewer bytes in the trailing free space... -- The Almighty SYSCALL Instruction -------------------------------------------- The AMD64 psABI [5] has something curious to say about syscalls on Linux: A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11. This is pretty vague; what nefarious things could the kernel possibly be doing with those registers!? The answer lies not in the psABI, but in the entry for the SYSRET instruction in Intel's SDM (and its counterpart in AMD's manual): SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from R11. This means that rcx is effectively guaranteed to contain the return address on return from a 'syscall' instruction. Thus, after running a syscall (even one that has no external side effects), we have a pointer into the program image, which we can use to push 8 bytes from anywhere in the file onto the stack, with only a 3-byte 'push [rcx+disp8]'. A no-op 'syscall' at the start of the instruction sequence costs us 2 extra bytes, but we can make up for it by using only 4 basic blocks instead of 5, saving a 2-byte jump instruction. We can use the conveniently-sized 8-byte space at offset 0x4 to store the "Hello, w" part, and leave the "orld!\n" part at the end of the file. Then, at runtime, we can push the two parts to the stack with a pair of 'push' instructions. Thus, by the power of the almighty SYSCALL instruction, we can make our code a bit less twisty, and thereby shrink our file from 84 to 81 bytes. 0x0: jg 0x47 /* "\177E" in ELFMAG */ 0x4: .ascii "Hello, w" 0x47: jrcxz 0x3c 0x3c: syscall /* read(0, NULL, 0) */ pop rax /* nr = argc = 1 = __NR_write */ push rax /* [rsp] = 1 */ pop rdi /* fd = 1 */ push [rcx+0xd] /* rsp = "orld!\n" */ push [rcx-0x3a] /* rsp = "Hello, world!\n" */ jrcxz 0x3c /* (not taken) */ jmp 0x28 .ascii "orld!\n" 0x28: mov dl, 14 /* len = 14 */ mov rsi, rsp /* 48 8b f4 */ /* buf = "Hello, world!\n" */ syscall /* 0f 05 */ /* write(1, "Hello, world!\n", 14) */ mov al, 231 /* nr = 231 = __NR_exit_group */ mov dil, 0x0 /* 40 b7 00 */ /* error_code = 0 */ syscall /* exit_group(0) */ 00000000: 7f45 4c46 4865 6c6c 6f2c 2077 0100 0000 .ELFHello, w.... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 b20e 488b f40f 05b0 ..........H..... 00000030: e740 b700 0f05 3800 0100 b800 0f05 5850 .@....8.......XP 00000040: 5fff 710d ff71 c6e3 f3eb dd6f 726c 6421 _.q..q.....orld! 00000050: 0a . (Note that as our no-op syscall, we attempt to read the stdin stream for 0 bytes. Could this be considered a side effect? If stdin is redirected to a special file from some random driver or subsystem in the kernel, pretty much anything could happen when it receives a 0-byte read request from userspace. However, apart from FUSE, I am not aware of any driver that doesn't simply discard or reject such a read request, so I wouldn't consider this a serious issue.) -- Ouroboros Programming ------------------------------------------------------- ὁ <δὲ> κύκλος ἐπίπεδον σχῆμά ἐστιν ὑπὸ μιᾶς γραμμῆς περιεχόμενον, καὶ ταύτῃ κυκλικὸν ὀνομάζεται σχῆμα ἀφ' ἑαυτοῦ ἀρχόμενον καὶ εἰς ἑαυτὸν ἀναστρέφοντος καὶ μηδαμοῦ περατουμένου. ὅθεν καὶ Αἰγύπτιοι καθ' ἱερὸν λόγον δράκοντα οὐρηβόρον ταῖς πυραμίσιν ἐγγλύφουσιν. [6] Now, the circle is a plane figure bounded by a single line, and thus a shape that begins from itself and ends with itself is called "circular"--which is particularly [true of] time which returns into itself and is never terminated. Hence also the Egyptians, in accordance with a sacred discourse, carve a serpent eating its tail on their pyramids. [7] So says John Lydus (translated by Mischa Hooker) regarding the symbol of the ouroboros. Though our program is required to terminate, we can design it so that it returns into itself, as the ouroboros does. At this point, our instruction sequence both begins and ends with a 'syscall', and we are already forced to jump from the end to the start. Thus, we can close the circle, letting the instruction pointer advance once more into the initial 'syscall', so that we no longer need the final 'syscall'. This allows us to save 2 bytes. However, adapting the program in this way isn't entirely trivial, since we can't use 'mov rsi, rsp' to cover the 0xf4 byte in phdr->p_filesz. (There's no way to set rsp to its final value early enough in the code.) Instead, we can use that byte to help us set rax to 1. First, we write 'cmp al, 0xf4' (3c f4), covering the 0xf4 byte. Since al is 0 at this point, the comparison will always set the carry flag. Then, at the end of phdr->p_filesz, we write 'adc al, 0x0' (14 00), covering the 0x00 byte. As long as the carry flag is not cleard between the two instructions, the latter will set al to 1, as desired. However, there's a catch to this trick. It only works .----. if rax starts at 0, but our program begins with the .' o /------. no-op read(0, NULL, 0) syscall, which uses rax for / /SYSCALL '-. its return value. If it fails, rax will be set to a ' JMP'.-----.POP '. negative errno value, and our trick won't set it to | OR '. '-.CM \ 1 properly. Thus, for our program to work, stdin / X .-----' \ P \ must be a readable stream. Since lots of command- / V / . \ P \ line applications can't cope with a missing or ' O . |\ . U ' bizzare stdin stream, I think that this is a | M | .----' '------. | S | reasonable restriction for a Hello World | L | |Hello, world!| | H | program, akin to requiring that argc == 1. | L | '-------------' | M | . A ' ' O . When we remove the final 'syscall', rotate the \ C \ / V / program so it returns to the initial 'syscall', \ S \ / A / and use our trick to set rax to 1, we obtain a \ YS'-. .-'CD / 79-byte program. Now, our Hello World has become '. POP'------'PMJ .' even tinier than the original return-42 program! '-. HSUPHSUP .-' '--------' 0x0: jg 0x47 /* "\177E" in ELFMAG */ 0x4: .ascii "Hello, w" 0x47: jmp 0x28 .ascii "orld!\n" 0x28: syscall /* read(0, NULL, 0) = 0 */ pop rdi /* fd = argc = 1 */ cmp al, 0xf4 /* 3c f4 */ /* CF = 1 */ push [rcx+0x1f] /* ff 71 1f */ /* rsp = "orld!\n" */ mov dl, 14 /* len = 14 */ adc al, 0x0 /* 14 00 */ /* nr = 1 = __NR_write */ jmp 0x3c 0x3c: push [rcx-0x26] /* rsp = "Hello, world!\n" */ push rsp /* [rsp] = "Hello, world!\n" */ pop rsi /* buf = "Hello, world!\n" */ syscall /* write(1, "Hello, world!\n", 14) */ mov al, 231 /* nr = 231 = __NR_exit_group */ xor edi, edi /* error_code = 0 */ jmp 0x28 0x28: syscall /* exit_group(0) */ 00000000: 7f45 4c46 4865 6c6c 6f2c 2077 0100 0000 .ELFHello, w.... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 0f05 5f3c f4ff 711f .........._<..q. 00000030: b20e 1400 eb06 3800 0100 1500 ff71 da54 ......8......q.T 00000040: 5e0f 05b0 e731 ffeb df6f 726c 6421 0a ^....1...orld!. -- Tightening the Loop --------------------------------------------------------- Recall that the ouroboros begins from itself and ends with itself. In the art of Ouroboros Programming, we must unify the beginning and the end, creating a whole that encompasses both. Our program begins by setting the arguments for write(), and ends by setting the arguments for exit_group(). How can we turn these two sections into one, so that we only need a single 'syscall' instruction? Since it has only one argument, exit_group() is pretty simple to call. We can modify the logic of the larger part of the loop so that it sets rax = __NR_write and rdi = 1 on the first iteration, then sets rax = __NR_exit_group and rdi = 0 on the second iteration. We can leave the rest of the logic unmodified, since it only sets the value of rsi after pushing the two parts. Setting rdi is trivial: since we already have 'pop rdi' early in the program to set its value to argc = 1, we can push a 0 to the stack right before calling write(), so it will be read by 'pop rdi'. Meanwhile, setting rax is trickier, since we want it to have a large initial value only on the second iteration. Luckily, we can do this with only 1 additional byte, by adding 'xchg ebx, eax' to the start of the loop, and 'mov bl, 230' at the end of the loop. (The loop increments al by 1, so by the end of the second iteration, rax will become __NR_exit_group = 231.) Finally, since there's no more room for the "orld!\n" part at the end of the file, we can move it over to offset 0x3c, so that the jump can skip right past it. Compared to the last program, we have added 1 byte to 'mov al, 231', and removed 1 byte from 'xor edi, edi', so those balance out. We still save 2 bytes from merging the two 'syscall' instructions, so our program is now 77 bytes. 0x0: jg 0x47 /* "\177E" in ELFMAG */ 0x4: .ascii "Hello, w" 0x3c: .ascii "orld!\n" 0x47: syscall /* read(0, NULL, 0) = 0 */ xchg ebx, eax /* (no-op) */ pop rdi /* fd = argc = 1 */ jmp 0x28 0x28: push [rcx-0xd] /* rsp = "orld!\n" */ cmp al, 0xf4 /* 3c f4 */ /* CF = 1 */ push [rcx-0x45] /* ff 71 bb */ /* rsp = "Hello, world!\n" */ push rsp /* [rsp] = "Hello, world!\n" */ pop rsi /* buf = "Hello, world!\n" */ adc al, 0x0 /* 14 00 */ /* nr = 1 = __NR_write */ jmp 0x42 0x42: push rdx /* [rsp] = 0 */ mov dl, 14 /* len = 14 */ mov bl, 230 /* rbx = 230 */ syscall /* write(1, "Hello, world!\n", 14) */ xchg ebx, eax /* nr = 230 */ pop rdi /* error_code = 0 */ jmp 0x28 0x28: push [rcx-0xd] cmp al, 0xf4 /* 3c f4 */ /* CF = 1 */ push [rcx-0x45] /* ff 71 bb */ push rsp pop rsi adc al, 0x0 /* 14 00 */ /* nr = 231 = __NR_exit_group */ jmp 0x42 0x42: push rdx mov dl, 14 mov bl, 230 syscall /* exit_group(0) */ 00000000: 7f45 4c46 4865 6c6c 6f2c 2077 0100 0000 .ELFHello, w.... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 ff71 f33c f4ff 71bb .........q.<..q. 00000030: 545e 1400 eb0c 3800 0100 1500 6f72 6c64 T^....8.....orld 00000040: 210a 52b2 0eb3 e60f 0593 5feb db !.R......._.. This is the end of the line: I don't know of any trick, method, or strategy that can make this program any tinier. I'm reluctant to say that this is the absolute minimum size; after all, anything can happen with self-modifying code, and there's too much free space to simulate every wild possibility of what it could do. However, I doubt that this program could be made any smaller through conventional means. -- Hello World in the Real World ----------------------------------------------- The tinier versions of this program have all required 5-level paging, to get as much use as possible out of phdr->p_filesz. It would be nice if we could have a slightly longer program that anyone can run at home, outside of a VM. Adapting our program in this way isn't actually all that difficult: the main difference is that phdr->p_filesz has to end in four 0x00 bytes, so that it doesn't become greater than phdr->p_memsz. We can reuse these bytes within an 'adc eax, 0x0' (15 00 00 00 00) instruction, to set rax to 1. Apart from that, adapting the program to 4-level paging just involves rotating the instructions around. The version presented here is 81 bytes, but I'm not quite so certain that a conventional 80-byte version isn't possible. 0x0: jg 0x47 /* "\177E" in ELFMAG */ 0x4: .ascii "Hello, w" 0x3c: .ascii "orld!\n" 0x47: syscall /* read(0, NULL, 0) = 0 */ jmp 0x28 0x28: xchg ebx, eax /* (no-op) */ mov dl, 14 /* len = 14 */ cmp al, 0xf4 /* 3c f4 */ /* CF = 1 */ pop rdi /* fd = argc = 1 */ nop adc eax, 0x0 /* 15 ... */ /* nr = 1 = __NR_write */ jmp 0x3c 0x3c: push [rcx+0x2] /* rsp = "orld!\n" */ push [rcx-0x45] /* rsp = "Hello, world!\n" */ push rsp /* [rsp] = "Hello, world!\n" */ pop rsi /* buf = "Hello, world!\n" */ mov bl, 230 /* rbx = 230 */ push rbp /* [rsp] = 0 */ syscall /* write(1, "Hello, world!\n", 14) */ jmp 0x28 0x28: xchg ebx, eax /* nr = 230 */ mov dl, 14 cmp al, 0xf4 /* 3c f4 */ /* CF = 1 */ pop rdi /* error_code = 0 */ nop adc eax, 0x0 /* 15 ...*/ /* nr = 231 = __NR_exit_group */ jmp 0x3c 0x3c: push [rcx+0x2] push [rcx-0x45] push rsp pop rsi mov bl, 230 push rbp syscall /* exit_group(0) */ 00000000: 7f45 4c46 4865 6c6c 6f2c 2077 0100 0000 .ELFHello, w.... 00000010: 0300 3e00 0c00 0000 0000 0000 0c00 0000 ..>............. 00000020: 0c00 0000 0000 0000 93b2 0e3c f45f 9015 ...........<._.. 00000030: 0000 0000 eb06 3800 0100 0000 ff71 02ff ......8......q.. 00000040: 71bb 545e b3e6 550f 05eb dd6f 726c 6421 q.T^..U....orld! 00000050: 0a . -- Conclusion ------------------------------------------------------------------ This entire task has perhaps been not very exciting from an ELF perspective; we've merely golfed a short x86_64 program within the peculiar constraints set out at the beginning. However, I think tasks like these provide partial answers to the eternal question of Kolmogorov complexity: "What sorts of programs can be expressed within an ELF file of a given length?" I've always been fascinated by this, since clearly any given program has an optimal length, as a mathematical property of the kernel and the processor. And as far as I know, the question for ELF files hasn't been studied very thoroughly; since any programs has to work around the headers, the format has been relatively unpopular for sizecoding. (Or perhaps the unpopularity just stems from the lack of simple MMIO on Linux...) Regardless, I hope that this can serve as a case study of the sort of "thinking outside the box" that can be helpful when golfing binary programs. At many points while writing this, I thought that I'd surely reached the limit of how tiny I could make the program, only for that limit to be broken by a brand-new technique different from what I'd previously been trying. Much of this, I think, arises from the flexiblity of control flow in binary code: a jump is no larger than any other instruction, so even with only a few basic blocks, there are countless ways to thread the program execution between them. This is particularly exemplified by my dramatically-named Ouroboros Programming (for which I am sure a more appropriate name already exists). Most instructions are general-purpose, so instead of thinking only in terms of an instruction's immediate function, we can also consider all of its possible alternative functions. Thus, a 'syscall' instruction isn't just some particular syscall, but every single syscall in the kernel that we're able to set up the arguments for. And a 'pop' instruction doesn't care whether we used a 'push' to store the data behind [rsp]; it just cares that it ended up there one way or another. It all raises the question, just how much code could we be reusing for multiple purposes in our golfed programs, if only the opportunities weren't so difficult to recognize? If my experience with this task is any lesson, it's more than we would think. -- References ------------------------------------------------------------------ [1] https://www.muppetlabs.com/~breadbox/software/tiny/teensy.html [2] https://www.muppetlabs.com/~breadbox/software/tiny/return42.html [3] https://nathanotterness.com/2021/10/tiny_elf_modernized.html [4] https://gitlab.archlinux.org/archlinux/arch-boxes [5] https://gitlab.com/x86-psABIs/x86-64-ABI [6] https://archive.org/details/ioannislaurenti00wuengoog/page/39 [7] https://archive.org/details/JohnLydusOnTheMonthsTr.Hooker2ndEd.2017/page/n82