Articles

    Exercises Similarity Calculation on Vast.ai

    Alex
    Alex

    Building InfiniteGrammar.de

    Once the similarity pipeline became part of the operating routine, local execution stopped being attractive.

    Similarity was no longer something I ran once in a while. It became something I wanted to run after a new generation batch, after reordering, and whenever a section started looking suspicious. At that point local execution became a friction point. Long CPU-heavy runs block the machine, get postponed, and eventually stop happening often enough.

    That is why I moved the similarity pipeline to Vast.ai.

    The goal was simple:

    Make similarity runs cheap, resumable, and operationally easy enough that running them becomes routine.

    What actually runs on Vast.ai

    vastai_similarity.py extends the local similarity pipeline with two execution modes:

    MethodHow it worksUsed for
    spacyTF-IDF text vectors + answer vectors + structure vectors + spaCy POS-tag vectorsProduction path
    semanticNeural embeddings from a remote Vast.ai endpoint such as BAAI/bge-m3Experimental

    Both methods write into the same similarity tables. The important difference is operational.

    • spacy stays on the same score range as the existing dashboards.
    • semantic produces a very different score scale and is therefore not directly compatible with the current frontend views calibrated for the TF-IDF pipeline.

    The production path is spacy as it does the job well enough for the task. The spacy method does not need a GPU. It is mostly a CPU-heavy batch workload:

    • build text TF-IDF vectors,
    • build answer TF-IDF vectors,
    • build structure vectors,
    • build POS-tag vectors with spaCy,
    • combine them,
    • compute pairwise cosine similarity,
    • compute summaries and clustering.

    That can be run locally. The issue is that it became annoyingly slow. A cheap remote CPU instance proved to be a better place for this than a laptop.

    The execution model

    The pipeline is built around a state file.

    That decision matters more than the instance type.

    The state file lives under vastai_runs/:

    vastai_runs/
      └── vastai_sim_state_<run_id>.json

    The run follows the same structure regardless of where the compute happens:

    • init fetches scope and exercise data and writes the state file
    • run computes vectors / embeddings and stores results in the state file
    • finalize writes results into the database

    That gives the pipeline three useful properties.

    It is resumable

    If a run stops halfway through, the state file already contains completed sections. Re-running continues from where it stopped.

    Remote compute does not need DB credentials

    The remote instance reads the state file and writes back to the state file. It never talks to the production database.

    Final DB writes stay local

    The compute node can disappear after the job is done. Once the updated state file is back on the local machine, finalize can write the results later.

    That is the core design.

    What is needed locally before using Vast.ai

    1. Vast.ai CLI

    Install and configure the Vast.ai CLI locally. The end state should be that vastai --help works and that you are already authenticated.

    2. Python environment for the similarity script

    For the production spacy method:

    pip install spacy scikit-learn psycopg2-binary python-dotenv numpy scipy
    python -m spacy download de_core_news_sm

    3. Config file

    Example config for the recommended spacy method:

    {
      "similarity_config": {
        "grammar_section_id": null,
        "level": "A2"
      },
      "SpacySimilarity": {
        "method": "spacy",
        "skip_features": true,
        "weights": {
          "text": 0.50,
          "answers": 0.15,
          "structure": 0.15,
          "pos": 0.20
        }
      }
    }

    The production-friendly default is to scope by level or grammar_section_id when testing, then broaden to all sections once the pipeline is behaving as expected.

    Running the production spacy method on Vast.ai

    Step 1. Searching for a cheap CPU instance

    The spacy path is CPU-only. No GPU needed.

    vastai search offers 'cpu_cores>=8 num_gpus=0 dph<0.10 reliability>0.95 inet_down>200' -o 'dph'
    FilterWhy it matters
    cpu_cores>=8Enough cores to make the run meaningfully faster than a laptop
    num_gpus=0Keeps the instance cheap
    dph<0.10Puts a budget ceiling on the search
    reliability>0.95Reduces the chance of the instance disappearing mid-run
    inet_down>200Makes dependency installation less painful

    Step 2. Creating the instance

    Once an OFFER_ID is picked:

    vastai create instance <OFFER_ID> --disk 25

    The base image does not need to be special. The deploy action installs the required Python packages itself.

    Step 3. Initializing the run locally

    This step fetches the exercise data from the database and writes the state file locally.

    python vastai_similarity.py --config spacy_similarity_config.json --action init

    At this point the database has already done its job. The remote node will not need DB access.

    Step 4. Deploying the run to Vast.ai

    python vastai_similarity.py --action deploy --run-id <RUN_ID> --instance-id <INSTANCE_ID>

    This action creates remote directories, uploads the scripts and the state file, installs dependencies, and starts the remote similarity job in the background.

    What gets uploaded: the similarity calculation script and the state file. What does not get uploaded: .env, database credentials. That is deliberate.

    Step 5. Monitoring the remote run

    python vastai_similarity.py --action logs --run-id <RUN_ID>

    Step 6. Finalize locally

    python vastai_similarity.py --action finalize --run-id <RUN_ID>

    This pulls the state file back from the instance via scp and writes the results to the database. That is the only step that writes to the database.

    Step 7. Destroying the instance

    vastai destroy instance <INSTANCE_ID>

    This is easy to forget. It is also the easiest way to waste money on Vast.ai.

    What the deploy action actually does

    The remote workflow is implemented directly in the script. The core sequence is:

    • ask Vast.ai for instance metadata,
    • derive SSH host and port,
    • create /root/sim and /root/sim/vastai_runs,
    • upload files with scp,
    • install Python dependencies,
    • start the job with nohup in the background,
    • persist remote metadata into the local state file.

    The remote start command:

    cd /root/sim && nohup python vastai_similarity.py --action run --run-id <RUN_ID> > job.log 2>&1 & echo "PID: $!"

    That is intentionally simple. No job queue, no orchestrator, no remote database access.

    Condensed: Setting up Vast.ai instances and running calculations

    The workflow is very straightforward, let me share it here.

    # 0) Install the Vast.ai CLI
    pip install vastai
    
    # 1) Set your API key
    # Get it from https://cloud.vast.ai/cli and paste it here
    vastai set api-key YOUR_API_KEY
    
    # 2) Create or upload an SSH key for new instances
    # Simplest CLI path:
    vastai create ssh-key --api-key YOUR_API_KEY
    
    # 3) Search for an instance
    # Cheap CPU box for generic Python / ETL / spaCy / data jobs
    vastai search offers 'cpu_cores>=8 num_gpus=0 dph<0.10 reliability>0.95 inet_down>200' -o 'dph'
    
    # Or a single-GPU box for ML / embeddings / inference jobs
    vastai search offers 'gpu_ram>=16 num_gpus=1 dph<0.50 reliability>0.98 cuda_vers>=12.0 inet_down>500' -o 'dph'
    
    # 4) Rent the machine
    # Replace OFFER_ID with the first column from search results
    # --image is required; --ssh and --direct make SSH access straightforward
    vastai create instance OFFER_ID --image pytorch/pytorch --disk 30 --ssh --direct
    
    # 5) Inspect the instance
    vastai show instances
    vastai show instance INSTANCE_ID --raw

    After the instance is up, connect using the SSH command shown by Vast.ai in the console / instance details.

    ssh -p SSH_PORT root@INSTANCE_IP
    
    # 6) Copy your code to the machine
    scp -P SSH_PORT -r ./your_project root@INSTANCE_IP:/root/work
    
    # 7) Connect
    ssh -p SSH_PORT root@INSTANCE_IP
    
    # 8) Inside the instance: install deps and run your job
    cd /root/work
    pip install -r requirements.txt
    
    # foreground
    python your_script.py --arg1 value1
    
    # or background
    nohup python your_script.py --arg1 value1 > job.log 2>&1 &
    tail -f job.log

    If you need a port from the remote machine on your laptop, use SSH port forwarding:

    ssh -p SSH_PORT root@INSTANCE_IP -L 8080:localhost:8080

    When the job is finished, copy results back and terminate the instance:

    # 9) Copy results home
    scp -P SSH_PORT root@INSTANCE_IP:/root/work/output.json ./output.json
    
    # 10) Destroy the instance so billing stops
    vastai destroy instance INSTANCE_ID

    Make sure to destroy the instance when done; stopping can reduce GPU cost, but destroying is the clean way to stop storage charges too.