dataclass  ¶
  Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
   dataclass  ¶
  Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
  
 __init__(
    block_size: int,
    num_kv_heads: int,
    head_size: int,
    dtype: dtype,
    attention_chunk_size: int,
) -> None
 
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
  Bases: AttentionSpec
KV cache spec for cross-attention layers in encoder-decoder models.
Source code in vllm/v1/kv_cache_interface.py
  
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
   dataclass  ¶
  Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
  class-attribute instance-attribute  ¶
 attention_chunk_size: int | None = None
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.
 
 __init__(
    block_size: int,
    num_kv_heads: int,
    head_size: int,
    dtype: dtype,
    sliding_window: int | None = None,
    attention_chunk_size: int | None = None,
) -> None
 
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
  classmethod  ¶
  Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
  classmethod  ¶
  Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
 The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
  instance-attribute  ¶
 kv_cache_tensors: list[KVCacheTensor]
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size for more details.
 instance-attribute  ¶
 num_blocks: int
How should model runner initialize the KV cache tensors for each layer
 
 __init__(
    num_blocks: int,
    kv_cache_tensors: list[KVCacheTensor],
    kv_cache_groups: list[KVCacheGroupSpec],
) -> None
 dataclass  ¶
 Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
 A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
  
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
| Type | Description | 
|---|---|
| int | The KV cache size in bytes | 
 classmethod  ¶
  Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
 A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
   dataclass  ¶
  Bases: FullAttentionSpec
Source code in vllm/v1/kv_cache_interface.py
  
 __init__(
    block_size: int,
    num_kv_heads: int,
    head_size: int,
    dtype: dtype,
    sliding_window: int | None = None,
    attention_chunk_size: int | None = None,
    cache_dtype_str: str | None = None,
) -> None
 classmethod  ¶
  Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
  Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
  Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
  
 __init__(
    block_size: int,
    num_kv_heads: int,
    head_size: int,
    dtype: dtype,
    sliding_window: int,
) -> None
 
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
  dataclass  ¶
  Bases: KVCacheSpec
A KV cache spec for multiple layers with the same type of attention. Here, same types means always need the same number of token slots. For example, sliding window attentions with different window sizes are not the same type and should not be merged into one UniformTypeKVCacheSpecs.
Source code in vllm/v1/kv_cache_interface.py
  classmethod  ¶
 from_specs(
    kv_cache_specs: dict[str, KVCacheSpec],
) -> Self | None
Return a SameTypeKVCacheSpecs object if all layers have the same type of KV cache spec. Return None if not.
Source code in vllm/v1/kv_cache_interface.py
  classmethod  ¶
 is_uniform_type(
    kv_cache_specs: dict[str, KVCacheSpec],
) -> bool
Whether all layers have the same type of KV cache spec.
Source code in vllm/v1/kv_cache_interface.py
  
 max_memory_usage_bytes(vllm_config: VllmConfig) -> int