Acceleration Methods
====================

LOOM-Eval provides cutting-edge acceleration techniques optimized for long-context processing. All methods support multi-GPU inference and are compatible with NVIDIA 3090, A-series, and H-series GPUs.

KV Cache Optimization
---------------------

.. list-table::
   :header-rows: 1
   :widths: 25 50 25
   
   * - Method
     - Description
     - Source
   * - H2O
     - Attention-based KV cache selection
     - `GitHub <https://github.com/FMInference/H2O>`__
   * - StreamingLLM
     - Retains initial tokens in cache
     - `GitHub <https://github.com/mit-han-lab/streaming-llm>`__
   * - SnapKV
     - Attention pooling before selection
     - `GitHub <https://github.com/FasterDecoding/SnapKV>`__
   * - L2Compress
     - Uses L2 norm for cache compression
     - `GitHub <https://github.com/alessiodevoto/l2compress>`__
   * - PyramidKV
     - Layer-wise budget allocation
     - `GitHub <https://github.com/Zefan-Cai/KVCache-Factory>`__
   * - CakeKV
     - Layer-specific preference scoring
     - `GitHub <https://github.com/antgroup/cakekv>`__
   * - KIVI
     - 2bit asymmetric quantization
     - `GitHub <https://github.com/jy-yuan/KIVI>`__
   * - ThinK
     - Query-driven key cache pruning
     - `GitHub <https://github.com/SalesforceAIResearch/ThinK>`__
   * - CaM
     - Cache merging for memory efficiency
     - `GitHub <https://github.com/zyxxmu/cam>`__

Sparse Attention
----------------

.. list-table::
   :header-rows: 1
   :widths: 25 50 25
   
   * - Method
     - Description
     - Source
   * - FlexPrefill
     - Dynamic context-aware sparse attention
     - `GitHub <https://github.com/bytedance/FlexPrefill>`__
   * - XAttention
     - Strided antidiagonal scoring (A100-optimized)
     - `GitHub <https://github.com/mit-han-lab/x-attention>`__

Usage
-----
Enable acceleration via CLI::

    loom-eval.run \
        --model_path Meta-Llama-3.1-8B-Instruct \
        --cfg_path ./benchmarks/General/RULER.yaml \
        --acceleration SnapKV \
        --gpu_ids 0 1 \
        --gpu_allocation 2 \
        --eval \

Performance (128K Context)
--------------------------
Results for Llama-3.1-8B-Instruct on RULER benchmark (40GB A100):

.. list-table::
   :header-rows: 1
   :widths: 30 20 20
   
   * - Method
     - Score
     - Time
   * - KIVI
     - 86.48
     - 0:32:57
   * - H2O
     - 22.98
     - 0:27:00
   * - StreamingLLM
     - 15.02
     - 0:26:35
   * - L2Compress
     - 23.16
     - 0:27:26
   * - CaM
     - 46.68
     - -
   * - CakeKV
     - 74.47
     - 0:28:26
   * - PyramidKV
     - 64.28
     - 0:26:39
   * - FlexPrefill
     - 76.50
     - 0:15:58
   * - SnapKV
     - 64.72
     - 0:26:44
   * - ThinK
     - 66.45
     - 0:35:46
   * - Transformers
     - 75.21
     - 3:23:10
   * - vLLM
     - 76.11
     - 0:27:20
   * - SGLang
     - 82.77
     - 0:20:27
   * - XAttention
     - 78.58
     - 0:16:34

Model Compatibility
-------------------
+---------------------+-----------+--------+--------+--------+
| Method              | llama     | mistral| glm    | qwen2  |
+=====================+===========+========+========+========+
| **KV Optimization** | ✓         | ✓      | ✗      | ✗      |
+---------------------+-----------+--------+--------+--------+
| **FlexPrefill**     | ✓         | ✗      | ✓      | ✓      |
+---------------------+-----------+--------+--------+--------+
| **CakeKV**          | ✓         | ✓      | ✗      | ✓      |
+---------------------+-----------+--------+--------+--------+
| **XAttention**      | ✓         | ✗      | ✗      | ✗      |
+---------------------+-----------+--------+--------+--------+

Installation Notes
------------------
**KIVI**::
    cd KIVI/quant && pip install -e .

**ThinK**::
    cd think/load_tiny_cuda_resource && make i

**FlexPrefill**::
    cd flex_prefil && pip install -e .

**XAttention**::
    cd x-attention && pip install -e .

Hardware Requirements
---------------------
- 128K context support on single 40GB A100 (all methods except L2Compress)
- Multi-GPU recommended for 24GB cards
- XAttention only support A100 optimization