feat(kernel): implement PagedFieldprintAttention triton kernel and benchmark
Mirror to GitLab / mirror (push) Waiting to run

This commit is contained in:
Antigravity Agent
2026-05-25 12:17:16 +00:00
parent 859a3cf5f1
commit 02953c7201
3 changed files with 218 additions and 0 deletions
@@ -0,0 +1,24 @@
# PagedFieldprintAttention Kernel Benchmark
This directory contains the Triton kernel implementation and benchmark suite for the `PagedFieldprintAttention` mechanism proposed in the Verifiable Dual-Path Architecture.
## Architecture
Modern autoregressive generation relies heavily on fused attention kernels (like FlashAttention) to prevent HBM memory thrashing. Unfused additions of cryptographic identity anchors force intermediate matrices to be written back to HBM, destroying inference throughput.
Our custom Triton kernel, `paged_fieldprint_attention_kernel`, resolves this by computing the attention scores for the cryptographic anchor tokens in phase 1, and the standard context tokens in phase 2, scaling and accumulating the softmax reduction entirely in SRAM.
## Files
- `fused_attention.py`: The Triton kernel implementation.
- `benchmark.py`: The `triton.testing.perf_report` harness comparing the naive PyTorch implementation against our fused Triton kernel.
## Execution
This benchmark requires a CUDA-enabled NVIDIA GPU.
```bash
pip install torch triton
python benchmark.py
```
The script will sweep across context lengths from 1,024 to 32,768 and generate `attention-latency-benchmark.csv` and a PNG plot demonstrating the $O(N)$ vs $O(N^2)$ memory bandwidth costs.