vllm.v1.attention.ops.flashmla ¶
_is_flashmla_available ¶
Source code in vllm/v1/attention/ops/flashmla.py
_raise_flashmla_unavailable ¶
flash_mla_with_kvcache_fp8 ¶
flash_mla_with_kvcache_fp8(
q: Tensor,
k_cache: Tensor,
block_table: Tensor,
cache_seqlens: Tensor,
head_dim_v: int,
tile_scheduler_metadata: Tensor,
num_splits: Tensor,
softmax_scale: float | None = None,
causal: bool = False,
descale_q: Tensor | None = None,
descale_k: Tensor | None = None,
) -> tuple[Tensor, Tensor]
Source code in vllm/v1/attention/ops/flashmla.py
get_mla_metadata_dense_fp8 ¶
get_mla_metadata_dense_fp8(
cache_seqlens: Tensor,
num_q_tokens_per_head_k: int,
num_heads_k: int,
) -> tuple[Tensor, Tensor]
Source code in vllm/v1/attention/ops/flashmla.py
is_flashmla_dense_supported ¶
Return: is_supported_flag, unsupported_reason (optional).
Source code in vllm/v1/attention/ops/flashmla.py
is_flashmla_sparse_supported ¶
Return: is_supported_flag, unsupported_reason (optional).