vllm.model_executor.parameter
__all__ module-attribute
¶
__all__ = [
"BasevLLMParameter",
"PackedvLLMParameter",
"PerTensorScaleParameter",
"ModelWeightParameter",
"ChannelQuantScaleParameter",
"GroupQuantScaleParameter",
"PackedColumnParameter",
"RowvLLMParameter",
]
BasevLLMParameter ¶
Bases: Parameter
Base parameter for vLLM linear layers. Extends the torch.nn.parameter by taking in a linear weight loader. Will copy the loaded weight into the parameter when the provided weight loader is called.
Source code in vllm/model_executor/parameter.py
__init__ ¶
Initialize the BasevLLMParameter
:param data: torch tensor with the parameter data :param weight_loader: weight loader callable
:returns: a torch.nn.parameter
Source code in vllm/model_executor/parameter.py
__new__ ¶
_shard_id_as_int ¶
Source code in vllm/model_executor/parameter.py
BlockQuantScaleParameter ¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for weight scales loaded for weights with block-wise quantization. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
ChannelQuantScaleParameter ¶
Bases: _ColumnvLLMParameter
Parameter class for weight scales loaded for weights with channel-wise quantization. Equivalent to _ColumnvLLMParameter.
Source code in vllm/model_executor/parameter.py
GroupQuantScaleParameter ¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for weight scales loaded for weights with grouped quantization. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
ModelWeightParameter ¶
Bases: _ColumnvLLMParameter
, RowvLLMParameter
Parameter class for linear layer weights. Uses both column and row parallelism.
Source code in vllm/model_executor/parameter.py
PackedColumnParameter ¶
Bases: _ColumnvLLMParameter
Parameter for model parameters which are packed on disk and support column parallelism only. See PackedvLLMParameter for more details on the packed properties.
Source code in vllm/model_executor/parameter.py
__init__ ¶
__init__(
packed_factor: Union[int, Fraction],
packed_dim: int,
marlin_tile_size: Optional[int] = None,
bitblas_tile_size: Optional[int] = None,
**kwargs,
)
Source code in vllm/model_executor/parameter.py
adjust_shard_indexes_for_packing ¶
Source code in vllm/model_executor/parameter.py
PackedvLLMParameter ¶
Bases: ModelWeightParameter
Parameter for model weights which are packed on disk. Example: GPTQ Marlin weights are int4 or int8, packed into int32. Extends the ModelWeightParameter to take in the packed factor, the packed dimension, and optionally, marlin tile size for marlin kernels. Adjusts the shard_size and shard_offset for fused linear layers model weight loading by accounting for packing and optionally, marlin tile size.
Source code in vllm/model_executor/parameter.py
__init__ ¶
__init__(
packed_factor: Union[int, Fraction],
packed_dim: int,
marlin_tile_size: Optional[int] = None,
bitblas_tile_size: Optional[int] = None,
**kwargs,
)
Source code in vllm/model_executor/parameter.py
adjust_shard_indexes_for_packing ¶
Source code in vllm/model_executor/parameter.py
PerTensorScaleParameter ¶
Bases: BasevLLMParameter
Parameter class for scales where the number of scales is equivalent to the number of logical matrices in fused linear layers (e.g. for QKV, there are 3 scales loaded from disk). This is relevant to weights with per-tensor quantization. Adds functionality to map the scalers to a shard during weight loading.
Note: additional parameter manipulation may be handled for each quantization config specifically, within process_weights_after_loading
Source code in vllm/model_executor/parameter.py
__init__ ¶
_load_into_shard_id ¶
Slice the parameter data based on the shard id for loading.
Source code in vllm/model_executor/parameter.py
load_column_parallel_weight ¶
load_merged_column_weight ¶
load_qkv_weight ¶
RowvLLMParameter ¶
Bases: BasevLLMParameter
Parameter class defining weight_loading functionality (load_row_parallel_weight) for parameters being loaded into linear layers with row parallel functionality. Requires an input_dim to be defined.
Source code in vllm/model_executor/parameter.py
load_row_parallel_weight ¶
load_row_parallel_weight(loaded_weight: Tensor)
Source code in vllm/model_executor/parameter.py
SharedWeightParameter ¶
Bases: BasevLLMParameter
Parameter for weights with many shared tensors across a model
For example, when applying transforms to the "gate" and "up" partitions of MergedColumnParallelLinear
, the transform weights must stay separate tensors in order to allow for tensor memory sharing between layers.
Source code in vllm/model_executor/parameter.py
381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 |
|
kwargs instance-attribute
¶
kwargs = {
"input_dim": input_dim,
"output_dim": output_dim,
"weight_loader": _fake_weight_loader,
}
partitions instance-attribute
¶
partitions: dict[
int, Union[ModelWeightParameter, Parameter]
] = {}
tensors_registry class-attribute
instance-attribute
¶
tensors_registry: WeakValueDictionary = (
WeakValueDictionary()
)
__init__ ¶
Source code in vllm/model_executor/parameter.py
__new__ ¶
_fake_weight_loader ¶
_fake_weight_loader(
param: BasevLLMParameter,
loaded_weight: Tensor,
loaded_weight_shard_id: Optional[Union[str, int]],
)
Source code in vllm/model_executor/parameter.py
add_partition ¶
Add a partition to the weight parameter. Partitions whose data_key
is the same will share tensor data
:param index: index of partition to add :param data_key: hashable key used to key shared tensors :param args: arguments for torch.empty
:param *kwargs: keyword arguments for torch.empty
Source code in vllm/model_executor/parameter.py
load_column_parallel_weight ¶
load_column_parallel_weight(loaded_weight: Tensor)
Source code in vllm/model_executor/parameter.py
load_merged_column_weight ¶
load_merged_column_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
load_qkv_weight ¶
load_qkv_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
_ColumnvLLMParameter ¶
Bases: BasevLLMParameter
Private class defining weight loading functionality (load_merged_column_weight, load_qkv_weight) for parameters being loaded into linear layers with column parallelism. This includes QKV and MLP layers which are not already fused on disk. Requires an output dimension to be defined. Called within the weight loader of each of the column parallel linear layers.
Source code in vllm/model_executor/parameter.py
load_column_parallel_weight ¶
load_column_parallel_weight(loaded_weight: Tensor)
Source code in vllm/model_executor/parameter.py
load_merged_column_weight ¶
load_merged_column_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
load_qkv_weight ¶
load_qkv_weight(loaded_weight: Tensor, **kwargs)
Source code in vllm/model_executor/parameter.py
_adjust_shard_indexes_for_bitblas ¶
_adjust_shard_indexes_for_marlin ¶
_adjust_shard_indexes_for_packing ¶
_adjust_shard_indexes_for_packing(
shard_size,
shard_offset,
packed_factor,
marlin_tile_size,
bitblas_tile_size,
)
Source code in vllm/model_executor/parameter.py
permute_param_layout_ ¶
permute_param_layout_(
param: BasevLLMParameter,
input_dim: int,
output_dim: int,
**kwargs,
) -> BasevLLMParameter
Permute a parameter's layout to the specified input and output dimensions, useful for forcing the parameter into a known layout, for example, if I need a packed (quantized) weight matrix to be in the layout {input_dim = 0, output_dim = 1, packed_dim = 0} then I can call: permute_param_layout_(x, input_dim=0, output_dim=1, packed_dim=0) to ensure x is in the correct layout (permuting it to the correct layout if required, asserting if it cannot get it to the correct layout)