vllm.model_executor.models
Modules:
Name | Description |
---|---|
adapters | |
aimv2 | |
apertus | Inference-only Apertus model compatible with HuggingFace weights. |
arcee | |
arctic | Inference-only Snowflake Arctic model. |
aria | |
aya_vision | |
baichuan | Inference-only BaiChuan model compatible with HuggingFace weights. |
bailing_moe | Inference-only BailingMoE model compatible with HuggingFace weights. |
bamba | Inference-only Bamba model. |
bart | PyTorch BART model. |
bert | |
bert_with_rope | |
blip | Minimal implementation of BlipVisionModel intended to be only used |
blip2 | |
bloom | Inference-only BLOOM model compatible with HuggingFace weights. |
chameleon | |
chatglm | Inference-only ChatGLM model compatible with THUDM weights. |
clip | Minimal implementation of CLIPVisionModel intended to be only used |
cohere2_vision | Command-A-Vision (Cohere2Vision) multimodal model implementation for vLLM. |
commandr | PyTorch Cohere model. |
config | |
constant_size_cache | |
dbrx | |
deepseek | Inference-only Deepseek model. |
deepseek_eagle | |
deepseek_mtp | |
deepseek_v2 | Inference-only DeepseekV2/DeepseekV3 model. |
deepseek_vl2 | Inference-only Deepseek-VL2 model compatible with HuggingFace weights. |
donut | |
dots1 | Inference-only dots1 model. |
ernie45 | Inference-only Erine model compatible with HuggingFace weights. |
ernie45_moe | Inference-only ErineMoE model compatible with HuggingFace weights. |
ernie45_vl | Inference-only Erine VL model compatible with HuggingFace weights. |
ernie45_vl_moe | Inference-only Erine VL model compatible with HuggingFace weights. |
ernie_mtp | Inference-only Ernie-MTP model. |
exaone | Inference-only Exaone model compatible with HuggingFace weights. |
exaone4 | Inference-only Exaone model compatible with HuggingFace weights. |
fairseq2_llama | Llama model for fairseq2 weights. |
falcon | PyTorch Falcon model. |
falcon_h1 | Inference-only FalconH1 model. |
florence2 | |
fuyu | PyTorch Fuyu model. |
gemma | Inference-only Gemma model compatible with HuggingFace weights. |
gemma2 | |
gemma3 | |
gemma3_mm | |
gemma3n | |
gemma3n_mm | |
glm | Inference-only HF format GLM-4 model compatible with THUDM weights. |
glm4 | Inference-only GLM-4-0414 model compatible with HuggingFace weights. |
glm4_1v | Inference-only GLM-4V model compatible with HuggingFace weights. |
glm4_moe | Inference-only GLM-4.5 model compatible with HuggingFace weights. |
glm4_moe_mtp | Inference-only GLM-4.5 MTP model compatible with HuggingFace weights. |
glm4v | Inference-only CogAgent model compatible with THUDM weights. |
gpt2 | Inference-only GPT-2 model compatible with HuggingFace weights. |
gpt_bigcode | Inference-only GPTBigCode model compatible with HuggingFace weights. |
gpt_j | Inference-only GPT-J model compatible with HuggingFace weights. |
gpt_neox | Inference-only GPT-NeoX model compatible with HuggingFace weights. |
gpt_oss | |
granite | Inference-only IBM Granite model compatible with HuggingFace weights. |
granite_speech | Inference-only IBM Granite speech model. |
granitemoe | Inference-only GraniteMoe model. |
granitemoehybrid | Inference-only GraniteMoeHybrid model. |
granitemoeshared | Inference-only GraniteMoeShared model. |
gritlm | |
grok1 | Inference-only Grok1 model. |
h2ovl | |
hunyuan_v1 | Inference-only HunYuan model compatible with HuggingFace weights. |
hyperclovax_vision | |
idefics2_vision_model | PyTorch Idefics2 model. |
idefics3 | Inference-only Idefics3 model compatible with HuggingFace weights. |
interfaces | |
interfaces_base | |
intern_vit | |
internlm2 | |
internlm2_ve | |
interns1 | |
interns1_vit | |
internvl | |
jais | Inference-only Jais model compatible with HuggingFace weights. |
jamba | Inference-only Jamba model. |
jina_vl | |
keye | |
keye_vl1_5 | |
kimi_vl | |
lfm2 | |
llama | Inference-only LLaMA model compatible with HuggingFace weights. |
llama4 | Inference-only LLaMA model compatible with HuggingFace weights. |
llama4_eagle | |
llama_eagle | |
llama_eagle3 | |
llava | |
llava_next | |
llava_next_video | |
llava_onevision | |
mamba | PyTorch MAMBA model. |
mamba2 | PyTorch MAMBA2 model. |
mamba_cache | |
medusa | |
mimo | Inference-only MiMo model compatible with HuggingFace weights. |
mimo_mtp | Inference-only MiMo-MTP model. |
minicpm | Inference-only MiniCPM model compatible with HuggingFace weights. |
minicpm3 | Inference-only MiniCPM3 model compatible with HuggingFace weights. |
minicpm_eagle | Inference-only EagleMiniCPM model compatible with HuggingFace weights. |
minicpmo | Inference-only MiniCPM-O model compatible with HuggingFace weights. |
minicpmv | Inference-only MiniCPM-V model compatible with HuggingFace weights. |
minimax_cache | |
minimax_text_01 | Inference-only MiniMaxText01 model. |
minimax_vl_01 | |
mistral3 | |
mixtral | Inference-only Mixtral model. |
mixtral_quant | Inference-only Mixtral model. |
mllama | PyTorch Mllama model. |
mllama4 | |
mlp_speculator | |
modernbert | |
module_mapping | |
molmo | |
moonvit | |
mpt | |
nemotron | Inference-only Nemotron model compatible with HuggingFace weights. |
nemotron_h | Inference-only NemotronH model. |
nemotron_nas | Inference-only deci model compatible with HuggingFace weights. |
nemotron_vl | |
nvlm_d | |
olmo | Inference-only OLMo model compatible with HuggingFace weights. |
olmo2 | Inference-only OLMo2 model compatible with HuggingFace weights. |
olmoe | Inference-only OLMoE model compatible with HuggingFace weights. |
opt | Inference-only OPT model compatible with HuggingFace weights. |
orion | Inference-only Orion-14B model compatible with HuggingFace weights. |
ovis | PyTorch Ovis model. |
ovis2_5 | PyTorch Ovis model. |
paligemma | |
persimmon | Inference-only persimmon model compatible with HuggingFace weights. |
phi | Inference-only Phi-1.5 model compatible with HuggingFace weights. |
phi3 | Inference-only Phi3 model code inherit from Llama.py |
phi3v | |
phi4_multimodal | |
phi4flash | |
phi4mm | |
phi4mm_audio | |
phi4mm_utils | |
phimoe | Inference-only PhiMoE model. |
pixtral | |
plamo2 | Inference-only PLaMo2 model. |
prithvi_geospatial_mae | Inference-only IBM/NASA Prithvi Geospatial model. |
qwen | Inference-only QWen model compatible with HuggingFace weights. |
qwen2 | Inference-only Qwen2 model compatible with HuggingFace weights. |
qwen2_5_omni_thinker | Inference-only Qwen2.5-Omni model (thinker part). |
qwen2_5_vl | Inference-only Qwen2.5-VL model compatible with HuggingFace weights. |
qwen2_audio | Inference-only Qwen2-Audio model compatible with HuggingFace weights. |
qwen2_moe | Inference-only Qwen2MoE model compatible with HuggingFace weights. |
qwen2_rm | Inference-only Qwen2-RM model compatible with HuggingFace weights. |
qwen2_vl | Inference-only Qwen2-VL model compatible with HuggingFace weights. |
qwen3 | Inference-only Qwen3 model compatible with HuggingFace weights. |
qwen3_moe | Inference-only Qwen3MoE model compatible with HuggingFace weights. |
qwen_vl | Inference-only Qwen-VL model compatible with HuggingFace weights. |
registry | Whenever you add an architecture to this page, please also update |
roberta | |
rvl | |
seed_oss | Inference-only SeedOss model compatible with HuggingFace weights. |
siglip | Implementation of SiglipVisionModel intended to be only used |
siglip2navit | Implementation of SiglipVisionModel intended to be only used |
skyworkr1v | |
smolvlm | |
solar | Inference-only Solar model compatible with HuggingFace weights. |
stablelm | Inference-only StabeLM (https://github.com/Stability-AI/StableLM) |
starcoder2 | PyTorch Starcoder2 model. |
step3_text | Inference-only Jurassic model. |
step3_vl | |
swin | |
tarsier | |
telechat2 | |
teleflm | |
transformers | Wrapper around |
ultravox | PyTorch Ultravox model. |
utils | |
vision | |
voxtral | |
whisper | |
zamba2 | PyTorch Zamba2 model implementation for vLLM. |
ModelRegistry module-attribute
¶
ModelRegistry = _ModelRegistry(
{
model_arch: (
_LazyRegisteredModel(
module_name=f"vllm.model_executor.models.{mod_relname}",
class_name=cls_name,
)
)
for (model_arch, (mod_relname, cls_name)) in (
items()
)
}
)
__all__ module-attribute
¶
__all__ = [
"ModelRegistry",
"VllmModelForPooling",
"is_pooling_model",
"VllmModelForTextGeneration",
"is_text_generation_model",
"HasInnerState",
"has_inner_state",
"SupportsLoRA",
"supports_lora",
"SupportsMultiModal",
"supports_multimodal",
"SupportsPP",
"supports_pp",
"SupportsTranscription",
"supports_transcription",
"SupportsV0Only",
"supports_v0_only",
]
HasInnerState ¶
Bases: Protocol
The interface required for all models that has inner state.
Source code in vllm/model_executor/models/interfaces.py
SupportsLoRA ¶
Bases: Protocol
The interface required for all models that support LoRA.
Source code in vllm/model_executor/models/interfaces.py
SupportsMultiModal ¶
Bases: Protocol
The interface required for all multi-modal models.
Source code in vllm/model_executor/models/interfaces.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
supports_encoder_tp_data class-attribute
¶
supports_encoder_tp_data: bool = False
A flag that indicates whether this model supports multimodal_config.mm_encoder_tp_mode="data"
.
supports_multimodal class-attribute
¶
supports_multimodal: Literal[True] = True
A flag that indicates this model supports multi-modal inputs.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
supports_multimodal_raw_input_only class-attribute
¶
supports_multimodal_raw_input_only: bool = False
A flag that indicates this model supports multi-modal inputs and processes them in their raw form and not embeddings.
get_input_embeddings ¶
get_input_embeddings(
input_ids: Tensor,
multimodal_embeddings: Optional[
MultiModalEmbeddings
] = None,
attn_metadata: Optional[AttentionMetadata] = None,
) -> Tensor
get_input_embeddings(
input_ids: Tensor,
multimodal_embeddings: Optional[
MultiModalEmbeddings
] = None,
) -> Tensor
get_input_embeddings(
input_ids: Tensor,
multimodal_embeddings: Optional[
MultiModalEmbeddings
] = None,
attn_metadata: Optional[AttentionMetadata] = None,
) -> Tensor
Returns the input embeddings merged from the text embeddings from input_ids and the multimodal embeddings generated from multimodal kwargs.
Source code in vllm/model_executor/models/interfaces.py
get_language_model ¶
get_language_model() -> Module
Returns the underlying language model used for text generation.
This is typically the torch.nn.Module
instance responsible for processing the merged multimodal embeddings and producing hidden states
Returns:
Type | Description |
---|---|
Module | torch.nn.Module: The core language model component. |
Source code in vllm/model_executor/models/interfaces.py
get_multimodal_embeddings ¶
get_multimodal_embeddings(
**kwargs: object,
) -> MultiModalEmbeddings
Returns multimodal embeddings generated from multimodal kwargs to be merged with text embeddings.
Note
The returned multimodal embeddings must be in the same order as the appearances of their corresponding multimodal data item in the input prompt.
Source code in vllm/model_executor/models/interfaces.py
get_placeholder_str classmethod
¶
Get the placeholder text for the i
th modality
item in the prompt.
SupportsPP ¶
Bases: Protocol
The interface required for all models that support pipeline parallel.
Source code in vllm/model_executor/models/interfaces.py
supports_pp class-attribute
¶
supports_pp: Literal[True] = True
A flag that indicates this model supports pipeline parallel.
Note
There is no need to redefine this flag if this class is in the MRO of your model class.
forward ¶
forward(
*, intermediate_tensors: Optional[IntermediateTensors]
) -> Union[Tensor, IntermediateTensors]
Accept IntermediateTensors
when PP rank > 0.
Return IntermediateTensors
only for the last PP rank.
Source code in vllm/model_executor/models/interfaces.py
make_empty_intermediate_tensors ¶
make_empty_intermediate_tensors(
batch_size: int, dtype: dtype, device: device
) -> IntermediateTensors
Called when PP rank > 0 for profiling purposes.
SupportsTranscription ¶
Bases: Protocol
The interface required for all models that support transcription.
Source code in vllm/model_executor/models/interfaces.py
674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 |
|
supports_transcription_only class-attribute
¶
supports_transcription_only: bool = False
Transcription models can opt out of text generation by setting this to True
.
__init_subclass__ ¶
Source code in vllm/model_executor/models/interfaces.py
get_generation_prompt classmethod
¶
get_generation_prompt(
audio: ndarray,
stt_config: SpeechToTextConfig,
model_config: ModelConfig,
language: Optional[str],
task_type: Literal["transcribe", "translate"],
request_prompt: str,
to_language: Optional[str],
) -> PromptType
Get the prompt for the ASR model. The model has control over the construction, as long as it returns a valid PromptType.
Source code in vllm/model_executor/models/interfaces.py
get_num_audio_tokens classmethod
¶
get_num_audio_tokens(
audio_duration_s: float,
stt_config: SpeechToTextConfig,
model_config: ModelConfig,
) -> Optional[int]
Map from audio duration to number of audio tokens produced by the ASR model, without running a forward pass. This is used for estimating the amount of processing for this audio.
Source code in vllm/model_executor/models/interfaces.py
get_other_languages classmethod
¶
get_speech_to_text_config classmethod
¶
get_speech_to_text_config(
model_config: ModelConfig,
task_type: Literal["transcribe", "translate"],
) -> SpeechToTextConfig
Get the speech to text config for the ASR model.
validate_language classmethod
¶
Ensure the language specified in the transcription request is a valid ISO 639-1 language code. If the request language is valid, but not natively supported by the model, trigger a warning (but not an exception).
Source code in vllm/model_executor/models/interfaces.py
SupportsV0Only ¶
Bases: Protocol
Models with this interface are not compatible with V1 vLLM.
Source code in vllm/model_executor/models/interfaces.py
VllmModelForPooling ¶
Bases: VllmModel[T_co]
, Protocol[T_co]
The interface required for all pooling models in vLLM.
Source code in vllm/model_executor/models/interfaces_base.py
default_pooling_type class-attribute
¶
default_pooling_type: str = 'LAST'
Indicates the vllm.model_executor.layers.pooler.PoolerConfig.pooling_type to use by default.
You can use the vllm.model_executor.models.interfaces_base.default_pooling_type decorator to conveniently set this field.
VllmModelForTextGeneration ¶
has_inner_state ¶
has_inner_state(model: object) -> TypeIs[HasInnerState]
has_inner_state(
model: type[object],
) -> TypeIs[type[HasInnerState]]
has_inner_state(
model: Union[type[object], object],
) -> Union[
TypeIs[type[HasInnerState]], TypeIs[HasInnerState]
]
is_pooling_model ¶
is_pooling_model(
model: type[object],
) -> TypeIs[type[VllmModelForPooling]]
is_pooling_model(
model: object,
) -> TypeIs[VllmModelForPooling]
is_pooling_model(
model: Union[type[object], object],
) -> Union[
TypeIs[type[VllmModelForPooling]],
TypeIs[VllmModelForPooling],
]
Source code in vllm/model_executor/models/interfaces_base.py
is_text_generation_model ¶
is_text_generation_model(
model: type[object],
) -> TypeIs[type[VllmModelForTextGeneration]]
is_text_generation_model(
model: object,
) -> TypeIs[VllmModelForTextGeneration]
is_text_generation_model(
model: Union[type[object], object],
) -> Union[
TypeIs[type[VllmModelForTextGeneration]],
TypeIs[VllmModelForTextGeneration],
]
Source code in vllm/model_executor/models/interfaces_base.py
supports_lora ¶
supports_lora(
model: type[object],
) -> TypeIs[type[SupportsLoRA]]
supports_lora(model: object) -> TypeIs[SupportsLoRA]
supports_lora(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsLoRA]], TypeIs[SupportsLoRA]
]
Source code in vllm/model_executor/models/interfaces.py
supports_multimodal ¶
supports_multimodal(
model: type[object],
) -> TypeIs[type[SupportsMultiModal]]
supports_multimodal(
model: object,
) -> TypeIs[SupportsMultiModal]
supports_multimodal(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsMultiModal]],
TypeIs[SupportsMultiModal],
]
supports_pp ¶
supports_pp(
model: type[object],
) -> TypeIs[type[SupportsPP]]
supports_pp(model: object) -> TypeIs[SupportsPP]
supports_pp(
model: Union[type[object], object],
) -> Union[
bool, TypeIs[type[SupportsPP]], TypeIs[SupportsPP]
]
Source code in vllm/model_executor/models/interfaces.py
supports_transcription ¶
supports_transcription(
model: type[object],
) -> TypeIs[type[SupportsTranscription]]
supports_transcription(
model: object,
) -> TypeIs[SupportsTranscription]
supports_transcription(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsTranscription]],
TypeIs[SupportsTranscription],
]
supports_v0_only ¶
supports_v0_only(
model: type[object],
) -> TypeIs[type[SupportsV0Only]]
supports_v0_only(model: object) -> TypeIs[SupportsV0Only]
supports_v0_only(
model: Union[type[object], object],
) -> Union[
TypeIs[type[SupportsV0Only]], TypeIs[SupportsV0Only]
]