vllm.model_executor.layers.quantization.kernels.scaled_mm ¶
Modules:
| Name | Description |
|---|---|
ScaledMMLinearKernel | |
aiter | |
cpu | |
cutlass | |
flashinfer | |
pytorch | |
rocm | |
triton | |
_KernelConfigT module-attribute ¶
_KernelConfigT = TypeVar(
"_KernelConfigT", bound=ScaledMMLinearLayerConfig
)
_POSSIBLE_FP8_KERNELS module-attribute ¶
_POSSIBLE_FP8_KERNELS: dict[
PlatformEnum, list[type[FP8ScaledMMLinearKernel]]
] = {
CUDA: [
FlashInferFP8ScaledMMLinearKernel,
CutlassFP8ScaledMMLinearKernel,
PerTensorTorchFP8ScaledMMLinearKernel,
ChannelWiseTorchFP8ScaledMMLinearKernel,
],
ROCM: [
ROCmFP8ScaledMMLinearKernel,
PerTensorTorchFP8ScaledMMLinearKernel,
RowWiseTorchFP8ScaledMMLinearKernel,
ChannelWiseTorchFP8ScaledMMLinearKernel,
],
CPU: [
PerTensorTorchFP8ScaledMMLinearKernel,
ChannelWiseTorchFP8ScaledMMLinearKernel,
],
}
_POSSIBLE_INT8_KERNELS module-attribute ¶
_POSSIBLE_INT8_KERNELS: dict[
PlatformEnum, list[type[Int8ScaledMMLinearKernel]]
] = {
CPU: [CPUInt8ScaledMMLinearKernel],
CUDA: [
CutlassInt8ScaledMMLinearKernel,
TritonInt8ScaledMMLinearKernel,
],
ROCM: [
AiterInt8ScaledMMLinearKernel,
TritonInt8ScaledMMLinearKernel,
],
}
choose_scaled_mm_linear_kernel ¶
choose_scaled_mm_linear_kernel(
config: _KernelConfigT,
possible_kernels: dict[
PlatformEnum, list[type[_KernelT]]
],
compute_capability: int | None = None,
force_kernel: type[_KernelT] | None = None,
) -> type[_KernelT]
Choose a _KernelT that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | _KernelConfigT | Description of the linear layer to be implemented. | required |
possible_kernels | dict[PlatformEnum, list[_KernelT]] | A dictionary of platforms and their list list of possible kernels. | required |
compute_capability | Optional[int] | The compute capability of the target device, if None uses | None |
force_kernel | Optional[type[_KernelT]] | An Optional forced kernel to override the possible_kernels if it can be implemented. If None, it will only try the possible kernels. | None |
Raises:
| Type | Description |
|---|---|
ValueError | If no kernel can implement the given config. |
Returns:
| Name | Type | Description |
|---|---|---|
_KernelT | type[_KernelT] | Chosen kernel. |
Source code in vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py
init_fp8_linear_kernel ¶
init_fp8_linear_kernel(
activation_quant_key: QuantKey,
weight_quant_key: QuantKey,
out_dtype: dtype,
force_kernel: type[FP8ScaledMMLinearKernel]
| None = None,
module_name: str | None = None,
) -> FP8ScaledMMLinearKernel
Source code in vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py
init_int8_linear_kernel ¶
init_int8_linear_kernel(
is_channelwise: bool,
is_static_input_scheme: bool,
input_symmetric: bool,
module_name: str,
) -> Int8ScaledMMLinearKernel
Source code in vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py
is_supported_and_can_implement_kernel ¶
is_supported_and_can_implement_kernel(
kernel: type[_KernelT],
config: _KernelConfigT,
compute_capability: int | None,
) -> tuple[bool, str]