┌───────────────────────┐
                                                ▄▄▄▄▄ ▄▄▄▄▄ ▄▄▄▄▄       │
                                                │ █   █ █ █ █   █       │
                                                │ █   █ █ █ █▀▀▀▀       │
                                                │ █   █   █ █     ▄     │
                                                │                 ▄▄▄▄▄ │
                                                │                 █   █ │
                                                │                 █   █ │
                                                │                 █▄▄▄█ │
                                                │                 ▄   ▄ │
                                                │                 █   █ │
                                                │                 █   █ │
                                                │                 █▄▄▄█ │
                                                │                 ▄▄▄▄▄ │
                                                │                   █   │
ARM32 ELF Sizecoding                            │                   █   │
~ Deater                                        └───────────────────█ ──┘

--- 1) Background

In the demoscene there are various size-coding competitions where
the goal is to write the most interesting program in a certain size
range (often a power of two, like 256 bytes or 4k).  Usually the rules
count the executable header against the file size.  This isn't much
of a problem for most 8-bit systems, and it is definitely not a problem
for the most infamous platform (MS-DOS .COM files).  It is, however,
an extreme headache for people trying to write tiny Linux demos.  Often
a significant part of the file size ends up being the ELF header.

There has been a lot of work figuring out ways to abuse the ELF header
to allow smaller programs, however this work often concentrates on the
x86 architecture.  We were interested in seeing the limits on a 32-bit
ARM system, specifically a Raspberry Pi running Raspbian Linux.  On
this architecture with traditional tools a gcc-compiled executable
has 340 bytes of headers tacked to the front.  By using hand-built
techniques you can craft a custom statically-linked executable with
only 84 bytes of ELF header.  This is much better, and may be suitable
for a 256 byte demo, but what if we want things smaller?  Can we create
a 128 byte demo?

--- 2) The Challenge

An ARM32 executable with an 84 byte header leaves only 44 bytes free.
The default ARM32 RISC machine code has 32-bit instructions, which means
any 128 byte program would only have room for eleven instructions.
This isn't much, especially when you're running Linux and just setting
up a proper exit() system call takes 3 instructions.

Things can be mildly helped by using one of the ARM embedded instruction
encodings that provide improved code density.  We use THUMB-2 which
allows many common instruction patterns to fit in half the size
(16-bits).  You do need a new enough CPU to support this; the original
Rapberry Pi Model 1 machines with ARM 1176 processors can't do this, but
more recent models did add support for THUMB-2.

--- 3) The Demo

This project evolved into a demo called "Thumbpinski" which finished
in 9th place in the combined 128 byte demo category at the Lovebyte 2022
size-coding demoparty.

The demo doesn't have room to do much.  It draws a purple 
"Sierpinski Triangle" to the screen.  This is a much overdone effect
in sizecoding because you can generate it with a single bitwise AND
of the X and Y co-ordinates.  The graphics are actually just text made
by generating ANSI escape codes and printing to Linux stdout via
syscalls.  In addition a call to the nanosleep() syscall ensures
a consistent frame rate.

--- 4) Summary of the tricks

+ Placing code/data in the ELF Header

Our biggest trick is that even though the ELF header is still 84 bytes,
we manage to stash some code and data in the header itself at unimportant
or unused offsets.

We change things so execution is launched 8 bytes into the header, 
where 8 bytes of extended ABI/padding lives.  We fit in a few instructions
here with the last one being a branch to the main executable at the
normal location.

We also store some data in the e_shoff/e_flags area.  Linux doesn't
seem to complain as long as the first byte of e_flags is 0, so we
crafted our data so that location starts at 0 but we set it to a proper
value later (it will eventually hold ASCII value '0' or '5' depending on
if the current text being drawn is black or purple).

+ Overlapping the end of the header with code

Linux seems to ignore the p_align field which is the last 4 bytes
of the header, so we start the main code block overlapped with
this area.

+ Loading at a non-standard location

To save some space we force the executable to load at address 0x8000
instead of the more typical 0x10000.  By having the address fit in a
single 16 bit offset it allows smaller constants (and thus smaller
instructions) when working on pointers.

+ Optimizing the Code

The rest of the challenge was THUMB2 size optimization at the
assembly language level and so not really ELF related.  There's nothing
extra clever here, just careful use of short constants, making sure
we have as many 16-bit instructions as possible, using a few
THUMB-2 specific instructions, and not having an exit() syscall as
the program is an infinite loop.

--- 5) Conclusion

With our changes we managed to make an ARM32 Linux demo that fits in
exactly 128 bytes.  The executable runs, but readelf doesn't like it and
gdb refuses to work on it.  We've tested on many systems and it still runs as
of the Linux 6.1.21 kernel that ships with Raspbian in June of 2023.
Our new challenge is to use this knowledge to make a more exciting demo
for next time.

----------- the source code ------------

@ Build instructions:
@ as -mthumb-interwork -o thumbpinski.o thumbpinski.s
@ ld -N --thumb-entry=_start -Ttext=0x8000 -o thumbpinski thumbpinski.o
@ objcopy -O binary thumbpinski thumbpinski_small

.syntax unified
.thumb

@ Width of your terminal
XWIDTH = 79

@ ELF header

@ Normal load address is 0x00010054 / offset 0x54 (84)
@    we load at 0x8000 instead as the smaller constant fits in fewer bytes
@    we start executing at offset 8, which is inside the header

LOAD_OFFSET = 0x8
LOAD_ADDRESS = 0x00008000+LOAD_OFFSET

@ $00: magic number
.byte 0x7f,'E','L','F'

@ $04: class (32 bit)
.byte 0x01

@ $05: endianess (little)
.byte 0x01

@ $06: ELF version
.byte 0x01

@ $07: OS ABI (System V)
.byte 0x00

real_start:

@ $08: extended ABI + 7 bytes padding
@     Linux ignores this in statically linked apps?
@.byte 0x00,0x00,0x00,0x00, 0x00,0x00,0x00,0x00
    movw    r6,#:lower16:data_begin     @ 4 bytes
    movs    r3,#2               @ 2 bytes
    b.n _start              @ 2 bytes

@ $10: ELF file type (2==executable)
.byte 0x02,0x00

@ $12: target architecture (0x0028 == ARM32)
.byte 0x28,0x00

@ $14: version
.byte 0x01,0x00,0x00,0x00

@ $18: entry point in executable (+1 because we're thumb)
@.byte 0x55,0x00,0x01,0x00
.int LOAD_ADDRESS+1

@ $1C: e_phoff -- start of header following this one
.byte 0x34,0x00,0x00,0x00

@ $20: e_shoff -- start of section header table  (not needed as we have none)
@.byte 0x00,0x00,0x00,0x00
@ $24: e_flags -- flags [Version5 EABI] [soft-float ABI] (ignored by Linux?)
@ No, the very first (high) byte has rules?  0 is OK, but 'm' isn't?

@ we store some data here

.byte 0         @ pad to 8 bytes, also be sure 0 in the string
            @ in right place
data_begin:

color_string:
    .byte 27,'[','4',0,'m',' '  @   .ascii  "\033[40m "
linefeed:
    .ascii  "\n"

@ $28: e_ehsize -- Size of this header
.byte 0x34,0x00

@ $2A: e_phentsize -- size of program header entry
.byte 0x20,0x00

@ $2C: e_phnum -- number of entries in header table
.byte 0x01,0x00

time:
@ $2E: e_shentsize -- size of section header entry
.byte 0x00,0x00

@ $30: e_shnum -- number of entries in section header
.byte 0x00,0x00

@ $32: e_shstrndx -- index of section header
.byte 0x00,0x00

@@@@@@@@@@@@@@@@@@@@ END OF ELF HEADER

@@@@@@@@@@@@@@@@@@@@ Program header

@ $00: p_type -- 0x1 == PT_LOAD (loadable segment)
.byte 0x01,0x00,0x00,0x00

@ $04: p_offset -- offset of segment file in image
.byte LOAD_OFFSET,0x00,0x00,0x00

@ $08: p_vaddr -- virtual address of segment in image
.int LOAD_ADDRESS

@ $0c: p_paddr -- physical address
.int LOAD_ADDRESS

@ $10: p_filesz -- size of segment
.byte real_end-real_start,0x00,0x00,0x00

@ $14: p_memsz -- size of segment in memory
.byte real_end-real_start,0x00,0x00,0x00

@ $18: p_flags -- segment dependent flags
.byte 0x07,0x00,0x00,0x00

@ $1C: p_align -- alignment (Linux ignores this?)
@.byte 0x04,0x00,0x00,0x00

@@@@@@@@@@@@@@@@@@@@ end of program header

@ Thumb2 sierpinski

@ Syscalls
.equ SYSCALL_EXIT,  1
.equ SYSCALL_WRITE, 4
.equ SYSCALL_NANOSLEEP, 162

.equ STDIN,0
.equ STDOUT,1
.equ STDERR,2

        .globl _start
_start:
    @ code in header runs some setup before branching here
    @ it sets R3 to 2 and sets R6 to point to data area

    @ set delay for nanosleep to be 0x200.0000 nanoseconds (roughly 33ms)
    strb    r3,[r6,((time-data_begin)+7)]

forever_loop:
    movs    r3,#XWIDTH      @ initialize X
xloop:
    ands    r5,r3,r4        @ X AND Y = the sierpinski magic

    ite eq          @ Thumb2 condition code hackery
    moveq   r5,#'5'         @ purple
    movne   r5,#'0'         @ black

    movs    r1,r6           @ point r1 to color string (in r6)
    strb    r5,[r6,3]       @ patch string with 2/0
    movs    r2,#6           @ length goes in r2

    cbnz    r3,skip_lf      @ weird THUMB2 conditional branch
    adds    r2,r2,#1        @ tack linefeed on end
skip_lf:

    @================================
    @ write stdout
    @================================
    @ string in r1
    @ len in r2
write_stdout:
    movs    r0,#STDOUT
    movs    r7,#SYSCALL_WRITE
    swi #0

    subs    r3,r3,#1            @ decrement X
    bpl xloop

    adds    r4,r4,#1            @ increment Y

    @ sleep a bit to roughly force a constant framerate

    adds    r0,r6,#(time-data_begin)    @ time pointer in r0
    movs    r1,0                @ NULL
    movs    r7,#SYSCALL_NANOSLEEP
    swi #0

    b   forever_loop

real_end:

----------- end code ------------