Documentation Index
Fetch the complete documentation index at: https://vastai-80aa3a82-fix-stale-links.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Initial Rollout
The serverless engine learns the cost-vs-performance profile of each GPU class in yoursearch_params from real workers running real traffic (see Choosing GPUs). The speed at which it “settles” into the most cost-effective mix depends on how quickly workers are recruited and released, so it helps to apply a test load during the first day of operation to give the engine enough signal to converge.
Best practice is to scale to double the number of expected required workers, then back down, three separate times.
Simulating Load
For examples of how to simulate load against your endpoint, see the client examples in the Vast SDK repository: https://github.com/vast-ai/vast-sdk/blob/main/examples/client/vllm_load_example.pyManaging for Bursty Load
- Adjust
min_workers: This will change the number of managed inactive workers, and increase capacity for high peak - Check
max_workers: Ensure this parameter is set high enough for the serverless engine to create the necessary number of workers
Managing for Low Demand or Idle Periods
- Adjust
min_load: Reducingmin_loadwill reduce the minimum number of active workers. Set to1to reduce the number to its minimum value of 1 worker, or set to0to put all workers into inactive states. - Adjust
min_workers: This will change the number of managed inactive workers
Scaling to Zero
To allow your endpoint to fully scale to zero during idle periods, configureinactivity_timeout alongside your other scaling parameters. The inactivity_timeout value (in seconds) determines how long the endpoint must be idle before scaling down is permitted.
- To scale to zero active workers (while keeping cold workers available): set
min_load = 0and configure a positiveinactivity_timeout. Workers in thecold_workerspool will remain available for fast reactivation. - To scale to zero total workers: set
min_load = 0,cold_workers = 0, and configure a positiveinactivity_timeout. This minimizes cost during extended idle periods but incurs cold-start latency when traffic resumes. - To prevent scaling to zero regardless of other settings: set
inactivity_timeoutto a negative value (e.g.,-1).
0 for inactivity_timeout disables inactivity-based gating entirely, the endpoint will rely solely on normal autoscaling decisions.
Managing Queue Time
Usemax_queue_time and target_queue_time to control how the autoscaler responds to request queuing:
- Increase
max_queue_timeto allow more requests to buffer on each worker before the system holds them in the global queue. This is useful for workloads with predictable, longer processing times. - Decrease
target_queue_timeto trigger more aggressive scale-up when queue times rise, reducing latency at the cost of potentially higher worker counts. - Increase
target_queue_timeto tolerate higher queue times before scaling up, reducing costs when some latency is acceptable.