This document explains how to configure and use the dgp.filter.ai.kvcache filter in Dubbo-go-Pixiu.
The filter integrates with vLLM (/tokenize) and LMCache controller APIs (/lookup, /pin, /compress, /evict) to:
dgp.filter.ai.kvcache is an HTTP decode filter. A typical request flow is:
model and prompt (or fallback from messages).model + prompt) in the token manager./lookupllm_preferred_endpoint_id)/tokenize/lookup if neededcompress / pin / evict)Current cache-aware routing uses instance id matching:
llm_preferred_endpoint_id into request context.dgp.filter.llm.proxy reads this value and tries to select an endpoint by endpoint.id.So for routing to work:
LMCache lookup instance_id must equal pixiu cluster endpoint.id.If no match exists, the request falls back to normal load-balancing behavior.
Note:
/query_worker_info.listeners:
- name: net/http
protocol_type: HTTP
address:
socket_address:
address: 0.0.0.0
port: 8888
filter_chains:
filters:
- name: dgp.filter.httpconnectionmanager
config:
route_config:
routes:
- match:
prefix: /
route:
cluster: vllm_cluster
http_filters:
- name: dgp.filter.ai.kvcache
config:
enabled: true
vllm_endpoint: "http://127.0.0.1:8000"
lmcache_endpoint: "http://127.0.0.1:9000"
default_model: "demo"
request_timeout: "2s"
lookup_routing_timeout: "50ms"
hot_window: "5m"
hot_max_records: 300
token_cache:
enabled: true
max_size: 1024
ttl: "10m"
cache_strategy:
enable_compression: true
enable_pinning: true
enable_eviction: true
load_threshold: 0.7
memory_threshold: 0.85
hot_content_threshold: 10
pin_instance_id: "vllm-instance-1"
pin_location: "LocalCPUBackend"
compress_instance_id: "vllm-instance-1"
compress_location: "LocalCPUBackend"
compress_method: "zstd"
evict_instance_id: "vllm-instance-1"
retry:
max_attempts: 3
base_backoff: "100ms"
max_backoff: "2s"
circuit_breaker:
failure_threshold: 5
recovery_timeout: "10s"
half_open_max_calls: 2
- name: dgp.filter.llm.proxy
enabled
vllm_endpoint
/tokenize.lmcache_endpoint
lookup_routing_timeout
token_cache
model + "\x00" + prompt.cache_strategy.load_threshold
[0,1].cache_strategy.memory_threshold
[0,1], used for eviction decisions.cache_strategy.hot_content_threshold
hot_window to mark content as hot for pinning.retry
lookup/pin/compress/evict).circuit_breaker
[kvcache] prefix.load_threshold is treated as a ratio./tokenize and LMCache APIs to verify chain integration and routing hint behavior.endpoint.id matches LMCache instance_id for targeted routing.dgp.filter.ai.kvcache is placed before dgp.filter.llm.proxy.URL: https://github.com/apache/dubbo-go-pixiu-samples ai/kvcache