vllm.multimodal.cache
MultiModalCacheValue module-attribute
¶
MultiModalCacheValue = Union[
MultiModalProcessorCacheItem,
MultiModalProcessorCacheItemMetadata,
MultiModalKwargsItems,
MultiModalKwargsItem,
MultiModalKwargs,
Mapping[str, NestedTensors],
]
MultiModalProcessorCacheInItem module-attribute
¶
MultiModalProcessorCacheInItem: TypeAlias = Optional[
tuple[
MultiModalKwargsItem,
Sequence["ResolvedPromptUpdate"],
]
]
MultiModalProcessorCacheOutItem module-attribute
¶
MultiModalProcessorCacheOutItem: TypeAlias = tuple[
Optional[MultiModalKwargsItem],
Sequence["ResolvedPromptUpdate"],
]
BaseMultiModalCache ¶
Abstract base class to read/write multi-modal items from cache.
The idea of multi-modal caching is based on having a client and server where the client executes in the frontend process (=P0) and the server in the core process (=P1). The data flow is as follows:
is_cached() x N get_and_update()
P0: From API -----------------> -----------------> To P1
get_and_update()
P1: From P0 -----------------> To model
is_cached()
can be called any number of times in P0. However, get_and_update()
must be called in P0 and P1 one after another so that their cache eviction order remains the same.
This ensures that the keys in P0 and P1 caches are mirrored, allowing us to determine whether a key is cached in P1 by looking up the P0 cache, without having to communicate with P1.
Source code in vllm/multimodal/cache.py
clear_cache abstractmethod
¶
get_and_update ¶
Possibly update a sequence of multi-modal items based on whether they are in the underlying cache.
This update is done out-of-place and updates the cache eviction order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mm_items | Sequence[_I] | The multi-modal items to update. | required |
mm_hashes | list[str] | The hash of each item in | required |
Returns:
Type | Description |
---|---|
list[_O] | A new list of updated multi-modal items. |
Source code in vllm/multimodal/cache.py
get_and_update_item abstractmethod
¶
Possibly update a multi-modal item based on whether it is in the underlying cache.
This update is done out-of-place and updates the cache eviction order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mm_item | _I | The multi-modal item to update. | required |
mm_hash | str | The hash of | required |
Returns:
Type | Description |
---|---|
_O | The update multi-modal item. |
Source code in vllm/multimodal/cache.py
BaseMultiModalProcessorCache ¶
Bases: BaseMultiModalCache[MultiModalProcessorCacheInItem, MultiModalProcessorCacheOutItem]
The required interface for caches on P0.
Source code in vllm/multimodal/cache.py
is_cached ¶
Check whether a sequence of multi-modal items are in the underlying cache.
This DOES NOT update the cache eviction order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mm_hashes | list[str] | The hash of each item to check. | required |
Returns:
Type | Description |
---|---|
list[bool] | For each item, |
Source code in vllm/multimodal/cache.py
is_cached_item abstractmethod
¶
Check whether a multi-modal item is in the underlying cache.
This DOES NOT update the cache eviction order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mm_hash | str | The hash of the item to check. | required |
Returns:
Type | Description |
---|---|
bool |
|
Source code in vllm/multimodal/cache.py
BaseMultiModalReceiverCache ¶
Bases: BaseMultiModalCache[Optional[MultiModalKwargsItem], MultiModalKwargsItem]
The required interface for caches on P1.
Source code in vllm/multimodal/cache.py
get_and_update_features ¶
get_and_update_features(
mm_features: list[MultiModalFeatureSpec],
) -> list[MultiModalFeatureSpec]
Update multimodal features with cached encoder outputs.
Source code in vllm/multimodal/cache.py
MultiModalCache ¶
Source code in vllm/multimodal/cache.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
|
get_item_complexity classmethod
¶
get_item_complexity(value: MultiModalCacheValue) -> int
Get the number of leaf elements in a multi-modal cache value.
This provides a measure of structural complexity that can be useful for debugging cache performance and understanding data patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value | MultiModalCacheValue | The multi-modal cache value to analyze. | required |
Returns:
Type | Description |
---|---|
int | The number of leaf elements in the nested structure. |
Source code in vllm/multimodal/cache.py
get_item_size classmethod
¶
get_item_size(
value: MultiModalCacheValue, *, debug: bool = False
) -> int
Source code in vllm/multimodal/cache.py
get_leaf_size classmethod
¶
Source code in vllm/multimodal/cache.py
MultiModalProcessorCacheItem ¶
The data to store inside MultiModalProcessorOnlyCache
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item | MultiModalKwargsItem | The processed tensor data corresponding to a multi-modal item. | required |
prompt_updates | Sequence[ResolvedPromptUpdate] | The prompt updates corresponding to | required |
Source code in vllm/multimodal/cache.py
__init__ ¶
__init__(
item: MultiModalKwargsItem,
prompt_updates: Sequence[ResolvedPromptUpdate],
) -> None
MultiModalProcessorCacheItemMetadata ¶
The metadata to store inside MultiModalProcessorSenderCache
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item | MultiModalKwargsItem | The processed tensor data corresponding to a multi-modal item. Since P1 already stores the tensor data, we only store its size metadata in P0 to reduce memory usage. The size metadata is still needed to keep the same cache eviction policy as P0. | required |
prompt_updates | Sequence[ResolvedPromptUpdate] | The prompt updates corresponding to | required |
Source code in vllm/multimodal/cache.py
__init__ ¶
__init__(
item: MultiModalKwargsItem,
prompt_updates: Sequence[ResolvedPromptUpdate],
) -> None
MultiModalProcessorOnlyCache ¶
Bases: BaseMultiModalProcessorCache
The cache which is used on P0 when IPC caching is disabled.
How to update each item:
- If the item is in the cache, replace the input with the cached item.
- If the item is not in the cache, store that item (which includes tensor data and metadata) into the cache, and return the input.
Source code in vllm/multimodal/cache.py
_cache instance-attribute
¶
_cache = get_lru_cache(
mm_processor_cache_gb, MultiModalProcessorCacheItem
)
clear_cache ¶
get_and_update_item ¶
get_and_update_item(
mm_item: MultiModalProcessorCacheInItem, mm_hash: str
) -> MultiModalProcessorCacheOutItem
Source code in vllm/multimodal/cache.py
MultiModalProcessorSenderCache ¶
Bases: BaseMultiModalProcessorCache
The cache which is used on P0 when IPC caching is enabled.
How to update each item:
-
If the item is already in the cache, clear the input to avoid unnecessary IPC.
-
If the item is not in the cache, store the metadata of that item so that the eviction policy remains the same as the cache on P1, and return the input. By only storing the metadata, we avoid keeping the data itself in memory inside P0.
Source code in vllm/multimodal/cache.py
_cache instance-attribute
¶
_cache = get_lru_cache(
mm_processor_cache_gb,
MultiModalProcessorCacheItemMetadata,
)
clear_cache ¶
get_and_update_item ¶
get_and_update_item(
mm_item: MultiModalProcessorCacheInItem, mm_hash: str
) -> MultiModalProcessorCacheOutItem
Source code in vllm/multimodal/cache.py
MultiModalReceiverCache ¶
Bases: BaseMultiModalReceiverCache
The cache which is used on P1 when IPC caching is enabled.
How to update each item:
- If the item is in the cache, replace the input with the cached item.
- If the item is not in the cache, store that item (which includes tensor data) into the cache, and return the input.
Source code in vllm/multimodal/cache.py
clear_cache ¶
get_and_update_item ¶
get_and_update_item(
mm_item: Optional[MultiModalKwargsItem], mm_hash: str
) -> MultiModalKwargsItem
Source code in vllm/multimodal/cache.py
_enable_ipc_cache ¶
_enable_ipc_cache(vllm_config: VllmConfig) -> bool
Source code in vllm/multimodal/cache.py
_enable_processor_cache ¶
_enable_processor_cache(
model_config: ModelConfig,
mm_registry: MultiModalRegistry,
) -> bool
Source code in vllm/multimodal/cache.py
processor_cache_from_config ¶
processor_cache_from_config(
vllm_config: VllmConfig, mm_registry: MultiModalRegistry
) -> Optional[BaseMultiModalProcessorCache]
Return a BaseMultiModalProcessorCache
, if enabled.
Source code in vllm/multimodal/cache.py
processor_only_cache_from_config ¶
processor_only_cache_from_config(
model_config: ModelConfig,
mm_registry: MultiModalRegistry,
)
Return a MultiModalProcessorOnlyCache
, if enabled.
Source code in vllm/multimodal/cache.py
receiver_cache_from_config ¶
receiver_cache_from_config(
vllm_config: VllmConfig, mm_registry: MultiModalRegistry
) -> Optional[BaseMultiModalReceiverCache]
Return a BaseMultiModalReceiverCache
, if enabled.