Exercises Similarity Calculation on Vast.ai

Once the similarity pipeline became part of the operating routine, local execution stopped being attractive.

Similarity was no longer something I ran once in a while. It became something I wanted to run after a new generation batch, after reordering, and whenever a section started looking suspicious. At that point local execution became a friction point. Long CPU-heavy runs block the machine, get postponed, and eventually stop happening often enough.

That is why I moved the similarity pipeline to Vast.ai.

The goal was simple:

Make similarity runs cheap, resumable, and operationally easy enough that running them becomes routine.

What actually runs on Vast.ai

vastai_similarity.py extends the local similarity pipeline with two execution modes:

Method	How it works	Used for
`spacy`	TF-IDF text vectors + answer vectors + structure vectors + spaCy POS-tag vectors	Production path
`semantic`	Neural embeddings from a remote Vast.ai endpoint such as `BAAI/bge-m3`	Experimental

Both methods write into the same similarity tables. The important difference is operational.

spacy stays on the same score range as the existing dashboards.
semantic produces a very different score scale and is therefore not directly compatible with the current frontend views calibrated for the TF-IDF pipeline.

The production path is spacy as it does the job well enough for the task. The spacy method does not need a GPU. It is mostly a CPU-heavy batch workload:

build text TF-IDF vectors,
build answer TF-IDF vectors,
build structure vectors,
build POS-tag vectors with spaCy,
combine them,
compute pairwise cosine similarity,
compute summaries and clustering.

That can be run locally. The issue is that it became annoyingly slow. A cheap remote CPU instance proved to be a better place for this than a laptop.

The execution model

The pipeline is built around a state file.

That decision matters more than the instance type.

The state file lives under vastai_runs/:

vastai_runs/
  └── vastai_sim_state_<run_id>.json

The run follows the same structure regardless of where the compute happens:

init fetches scope and exercise data and writes the state file
run computes vectors / embeddings and stores results in the state file
finalize writes results into the database

That gives the pipeline three useful properties.

It is resumable

If a run stops halfway through, the state file already contains completed sections. Re-running continues from where it stopped.

Remote compute does not need DB credentials

The remote instance reads the state file and writes back to the state file. It never talks to the production database.

Final DB writes stay local

The compute node can disappear after the job is done. Once the updated state file is back on the local machine, finalize can write the results later.

That is the core design.

What is needed locally before using Vast.ai

1. Vast.ai CLI

Install and configure the Vast.ai CLI locally. The end state should be that vastai --help works and that you are already authenticated.

2. Python environment for the similarity script

For the production spacy method:

pip install spacy scikit-learn psycopg2-binary python-dotenv numpy scipy
python -m spacy download de_core_news_sm

3. Config file

Example config for the recommended spacy method:

{
  "similarity_config": {
    "grammar_section_id": null,
    "level": "A2"
  },
  "SpacySimilarity": {
    "method": "spacy",
    "skip_features": true,
    "weights": {
      "text": 0.50,
      "answers": 0.15,
      "structure": 0.15,
      "pos": 0.20
    }
  }
}

The production-friendly default is to scope by level or grammar_section_id when testing, then broaden to all sections once the pipeline is behaving as expected.

Running the production spacy method on Vast.ai

Step 1. Searching for a cheap CPU instance

The spacy path is CPU-only. No GPU needed.

vastai search offers 'cpu_cores>=8 num_gpus=0 dph<0.10 reliability>0.95 inet_down>200' -o 'dph'

Filter	Why it matters
`cpu_cores>=8`	Enough cores to make the run meaningfully faster than a laptop
`num_gpus=0`	Keeps the instance cheap
`dph<0.10`	Puts a budget ceiling on the search
`reliability>0.95`	Reduces the chance of the instance disappearing mid-run
`inet_down>200`	Makes dependency installation less painful

Step 2. Creating the instance

Once an OFFER_ID is picked:

vastai create instance <OFFER_ID> --disk 25

The base image does not need to be special. The deploy action installs the required Python packages itself.

Step 3. Initializing the run locally

This step fetches the exercise data from the database and writes the state file locally.

python vastai_similarity.py --config spacy_similarity_config.json --action init

At this point the database has already done its job. The remote node will not need DB access.

Step 4. Deploying the run to Vast.ai

python vastai_similarity.py --action deploy --run-id <RUN_ID> --instance-id <INSTANCE_ID>

This action creates remote directories, uploads the scripts and the state file, installs dependencies, and starts the remote similarity job in the background.

What gets uploaded: the similarity calculation script and the state file. What does not get uploaded: .env, database credentials. That is deliberate.

Step 5. Monitoring the remote run

python vastai_similarity.py --action logs --run-id <RUN_ID>

Step 6. Finalize locally

python vastai_similarity.py --action finalize --run-id <RUN_ID>

This pulls the state file back from the instance via scp and writes the results to the database. That is the only step that writes to the database.

Step 7. Destroying the instance

vastai destroy instance <INSTANCE_ID>

This is easy to forget. It is also the easiest way to waste money on Vast.ai.

What the deploy action actually does

The remote workflow is implemented directly in the script. The core sequence is:

ask Vast.ai for instance metadata,
derive SSH host and port,
create /root/sim and /root/sim/vastai_runs,
upload files with scp,
install Python dependencies,
start the job with nohup in the background,
persist remote metadata into the local state file.

The remote start command:

cd /root/sim && nohup python vastai_similarity.py --action run --run-id <RUN_ID> > job.log 2>&1 & echo "PID: $!"

That is intentionally simple. No job queue, no orchestrator, no remote database access.

Condensed: Setting up Vast.ai instances and running calculations

The workflow is very straightforward, let me share it here.

# 0) Install the Vast.ai CLI
pip install vastai

# 1) Set your API key
# Get it from https://cloud.vast.ai/cli and paste it here
vastai set api-key YOUR_API_KEY

# 2) Create or upload an SSH key for new instances
# Simplest CLI path:
vastai create ssh-key --api-key YOUR_API_KEY

# 3) Search for an instance
# Cheap CPU box for generic Python / ETL / spaCy / data jobs
vastai search offers 'cpu_cores>=8 num_gpus=0 dph<0.10 reliability>0.95 inet_down>200' -o 'dph'

# Or a single-GPU box for ML / embeddings / inference jobs
vastai search offers 'gpu_ram>=16 num_gpus=1 dph<0.50 reliability>0.98 cuda_vers>=12.0 inet_down>500' -o 'dph'

# 4) Rent the machine
# Replace OFFER_ID with the first column from search results
# --image is required; --ssh and --direct make SSH access straightforward
vastai create instance OFFER_ID --image pytorch/pytorch --disk 30 --ssh --direct

# 5) Inspect the instance
vastai show instances
vastai show instance INSTANCE_ID --raw

After the instance is up, connect using the SSH command shown by Vast.ai in the console / instance details.

ssh -p SSH_PORT root@INSTANCE_IP

# 6) Copy your code to the machine
scp -P SSH_PORT -r ./your_project root@INSTANCE_IP:/root/work

# 7) Connect
ssh -p SSH_PORT root@INSTANCE_IP

# 8) Inside the instance: install deps and run your job
cd /root/work
pip install -r requirements.txt

# foreground
python your_script.py --arg1 value1

# or background
nohup python your_script.py --arg1 value1 > job.log 2>&1 &
tail -f job.log

If you need a port from the remote machine on your laptop, use SSH port forwarding:

ssh -p SSH_PORT root@INSTANCE_IP -L 8080:localhost:8080

When the job is finished, copy results back and terminate the instance:

# 9) Copy results home
scp -P SSH_PORT root@INSTANCE_IP:/root/work/output.json ./output.json

# 10) Destroy the instance so billing stops
vastai destroy instance INSTANCE_ID

Make sure to destroy the instance when done; stopping can reduce GPU cost, but destroying is the clean way to stop storage charges too.