Legal professionals spend 20 to 40% of their time searching for information. Contract review requires cross-referencing clauses across dozens of agreements. Compliance teams must verify that policies align with regulatory requirements, often across hundreds of pages of regulation. Due diligence involves reading rooms full of documents to find specific provisions.
Keyword search fails legal work because legal language is deliberately precise but wildly inconsistent across documents. One contract says "indemnification," another says "hold harmless," a third says "defense and indemnity," all meaning approximately the same thing. A search for any one term misses the others. Semantic search understands the meaning, not just the words.
vai turns your contract library into a searchable knowledge base in minutes. Point it at a folder of legal documents, and it handles chunking, embedding with Voyage AI's legal-domain model, and indexing in MongoDB Atlas Vector Search. The result: semantic search that understands "What are our data deletion obligations?" finds answers across your GDPR summary, CCPA policy, privacy policy, and data processing addendum, even when each uses different terminology.
Documents
Your files
Chunk
Split text
Embed
Voyage AI
Index
MongoDB Atlas
Search
Semantic query
15 synthetic but realistic documents, ~39KB total. Small enough to process in minutes, rich enough to produce meaningful search results.
Download All (15 files, ~39KB)| File | Topic | Size |
|---|---|---|
master-services-agreement.md | MSA template: scope, payment terms, IP provisions | ~4KB |
saas-subscription-agreement.md | SaaS terms: uptime SLA, data handling, renewal/termination | ~3KB |
data-processing-addendum.md | DPA with GDPR and CCPA provisions, sub-processor obligations | ~3KB |
nda-mutual.md | Mutual NDA: definition of confidential info, exclusions, term | ~2KB |
nda-unilateral.md | One-way NDA: receiving party obligations, return/destruction | ~2KB |
employment-agreement.md | Employment terms: compensation, benefits, non-compete | ~3KB |
independent-contractor.md | Contractor agreement: deliverables, IP assignment, indemnification | ~3KB |
privacy-policy.md | Company privacy policy: data collection, retention, user rights | ~3KB |
acceptable-use-policy.md | AUP for SaaS product: prohibited uses, enforcement, liability caps | ~2KB |
ip-assignment-agreement.md | IP assignment: work product, prior inventions, moral rights | ~2KB |
gdpr-compliance-summary.md | GDPR requirements: lawful basis, data subject rights, DPO | ~3KB |
ccpa-compliance-summary.md | CCPA requirements: consumer rights, opt-out, service providers | ~3KB |
soc2-policy-overview.md | SOC 2 Trust Services Criteria: security, availability, confidentiality | ~2KB |
limitation-of-liability.md | Analysis of liability cap patterns across contract types | ~2KB |
force-majeure-clauses.md | Force majeure provisions: triggering events, notice, remedies | ~2KB |
From zero to a searchable knowledge base. Follow these steps, each takes 1-3 minutes.
Install vai
Install the vai CLI globally. If you already have it, skip to the next step.
added 1 package in 3s 1 package is looking for funding run `npm fund` for details
Configure credentials
Set your Voyage AI API key and MongoDB Atlas connection string. You can get a free Voyage AI key at dash.voyageai.com and a free MongoDB Atlas cluster at cloud.mongodb.com.
✓ api-key saved ✓ mongodb-uri saved
Your credentials are stored locally in ~/.vai/config.json and never shared.
Download the sample documents
Grab the 15-file sample legal document set. These are synthetic but realistic contracts, policies, and regulatory summaries covering a fictional company's legal library.
Archive: sample-docs.zip inflating: ./sample-docs/master-services-agreement.md inflating: ./sample-docs/saas-subscription-agreement.md ... inflating: ./sample-docs/force-majeure-clauses.md 15 files extracted
Ingest and embed the documents
Run the vai pipeline to chunk, embed, and index all 15 documents. This uses voyage-law-2, a model specifically trained on legal text, and creates a vector search index in MongoDB Atlas.
◼ Scanning ./sample-docs/ ... Found 15 files (39KB total) ◼ Chunking documents ... Created 142 chunks (avg 274 chars) ◼ Embedding with voyage-law-2 ... ████████████████████████████████ 142/142 chunks Embedded in 2.8s (51 chunks/sec) ◼ Storing in MongoDB Atlas ... Database: legal_demo Collection: legal_knowledge Inserted 142 documents ◼ Creating vector search index ... Index "vector_index" created on field "embedding" Dimensions: 1024 | Similarity: cosine ✓ Pipeline complete — 15 files → 142 indexed chunks
Run your first search
Test the knowledge base with a query that spans multiple documents. Notice how the legal-domain model finds relevant clauses even when the terminology differs.
Query: "What are our obligations if a customer requests deletion of their data?"
Model: voyage-law-2 | Results: 5
1. gdpr-compliance-summary.md (score: 0.95)
"Right to Erasure (Article 17): Data subjects have the right to
obtain from the controller the erasure of personal data without
undue delay. The controller shall erase personal data within
30 days of receiving a verified request..."
2. ccpa-compliance-summary.md (score: 0.91)
"Right to Delete (Section 1798.105): A consumer shall have the
right to request that a business delete any personal information
about the consumer which the business has collected..."
3. data-processing-addendum.md (score: 0.88)
"Data Deletion: Upon termination of the Agreement or upon
Controller's written request, Processor shall delete all
Personal Data processed on behalf of the Controller..."Try cross-document queries
Run queries that require understanding legal concepts across different document types. This is where semantic search shines over keyword search.
Query: "Compare the indemnification provisions across our contracts"
Model: voyage-law-2 | Results: 5
1. independent-contractor.md (score: 0.93)
"Indemnification: Contractor shall indemnify, defend, and hold
harmless Company from any claims, damages, or expenses arising
from Contractor's breach of this Agreement or negligence..."
2. master-services-agreement.md (score: 0.90)
"Mutual Indemnification: Each party shall indemnify the other
against third-party claims arising from (a) breach of
representations, (b) willful misconduct, or (c) violation
of applicable law..."
3. saas-subscription-agreement.md (score: 0.85)
"Provider Indemnification: Provider shall defend Customer against
any claim that the Service infringes a third party's intellectual
property rights..."Explore in the playground
Launch the vai playground for a visual interface. Browse your indexed legal documents, run queries interactively, and compare how different models handle legal terminology.
◼ Starting vai playground ... Server running at http://localhost:1958 Open your browser to explore: • Search your knowledge base • Compare embedding models • Visualize similarity scores
Try comparing voyage-law-2 results with voyage-4-large on the same legal query to see how the domain-specific model captures legal semantics.
See how semantic search handles real questions. Click a query to see the results.
“What are our obligations if a customer requests deletion of their data?”
Spans four documents: GDPR summary, CCPA summary, privacy policy, and DPA. Tests cross-document retrieval on the same legal concept expressed differently in each.
gdpr-compliance-summary.md
“Right to Erasure (Article 17): Data subjects have the right to obtain from the controller the erasure of personal data without undue delay. The controller shall erase personal data within 30 days of receiving a verified request.”
ccpa-compliance-summary.md
“Right to Delete (Section 1798.105): A consumer shall have the right to request that a business delete any personal information about the consumer which the business has collected.”
data-processing-addendum.md
“Upon termination of the Agreement or upon Controller's written request, Processor shall delete all Personal Data processed on behalf of the Controller within 30 calendar days.”
“Compare the indemnification provisions across our contracts”
Tests retrieval across MSA, contractor agreement, and SaaS agreement. Each uses slightly different indemnification language ("hold harmless," "defend and indemnify," "mutual indemnification").
“What happens if we cannot meet the SLA due to a natural disaster?”
Tests the intersection of force majeure provisions and SLA commitments across the SaaS agreement and force majeure clauses document.
“Do our NDAs allow sharing confidential information with sub-processors?”
Tests NDA exception clauses against DPA sub-processor provisions. A nuanced legal question requiring cross-document reasoning.
“What non-compete restrictions apply to former employees?”
Tests precise retrieval from the employment agreement, specifically the restrictive covenants section.
This is a real chatbot powered by the 15 legal sample docs you just explored. Ask it about contracts, GDPR compliance, indemnification clauses, NDAs, or any of the legal documentation.
| Model | Relevance | Notes |
|---|---|---|
voyage-law-2 Recommended | 95% | Purpose-built for legal text. Best at distinguishing legal terms that have different meanings in general English ("consideration," "party," "instrument"). |
voyage-4-large | 87% | Strong general-purpose model. Handles straightforward legal queries well, but misses nuance in cross-referencing clauses and legal term disambiguation. |
voyage-4-lite | 78% | Fast and cost-effective. Adequate for simple keyword-like queries, but struggles with the semantic precision legal search demands. |
For legal documents, voyage-law-2 consistently outperforms general-purpose models on queries that require understanding legal-specific semantics. The difference is most pronounced on queries like "Compare indemnification provisions" where the model needs to recognize that "hold harmless," "defend and indemnify," and "mutual indemnification" all refer to the same legal concept. For simple factual retrieval, the gap narrows, but the domain model is the clear choice for any serious legal search application.
You just built a working knowledge base from 16 sample docs. Here is what changes when you scale to thousands of real documents.
Privilege and confidentiality
Documents stay in your MongoDB Atlas cluster. Text is sent to Voyage AI for embedding (see their data handling policy). The resulting vectors do not contain readable text, but the stored chunks in MongoDB do. Plan your access controls accordingly.
Contract volume
A mid-size company might have 500 to 5,000 contracts. At this scale, initial embedding costs are modest with voyage-law-2, and queries cost fractions of a cent. Use vai estimate to project costs for your corpus size.
Metadata filtering
Legal search often needs filters by contract type, counterparty, or date range. vai supports metadata filters on search, so you can narrow results to "only NDAs signed in the last 2 years" before semantic ranking applies.
Keeping documents current
Contracts get amended, policies get updated. Re-run vai pipeline on updated files and it will re-chunk, re-embed, and update only the changed documents. Automate this as part of your document management workflow.
Conversational interface
The natural next step is vai chat: a compliance officer asking "Do we have any contracts expiring in the next 90 days with auto-renewal clauses?" and getting answers grounded in actual contract text.
Install vai and go from documents to searchable knowledge in minutes.
$ npm install -g voyageai-cli
Explore other use cases: Healthcare, Legal, Finance, and more