An engineer's take on how to build CLIP™ — cloud resources, service boundaries, data flows and the technology stack for both engines.
CLIP™ is a three-tier, two-engine platform. The tiers map to business capability; the engines map to runtime workloads. Both engines share a common data layer and a unified API gateway.
Business context: Sales planners currently perform a 6-step manual workflow every quarter — re-reading contracts, matching titles, confirming rights (across title × territory × language × media type × license type × window), calculating fees, and ad-hoc exploring near-miss titles. CLIP automates steps 1–4 completely and adds systematic intelligence to step 5 (Make-It-Eligible). The architecture below is designed to evaluate every combination of those six dimensions in real-time.
Module 1 and Module 2 are independently deployable services. They share a data layer but have zero runtime coupling — Engine 2 consumes Engine 1's output as JSON, not via function calls.
Rules are data, not code. A generic recursive evaluator handles every contract. New attributes require zero code changes — only master-data additions.
Engine 1 emits contract:rules-confirmed. Engine 2 subscribes and re-runs eligibility. No polling. No tight coupling. New consumers subscribe without engine changes.
Engine 2 publishes the attribute vocabulary that Engine 1's LLM must use. The feedback loop prevents hallucinated field names.
Module 1 transforms an 80–200 page PDF into structured, executable rules across four segments: Qualifiers, Start Date Window, Caps, and Rate Cards.
PDF → S3 → SQS message
OCR · layout · chunk into semantic sections
Chunks → vector store for RAG retrieval
Planner identifies term years, categories, segment boundaries
5 specialist LLMs run in parallel (Q · W · C · R · Amend)
Grounded against catalog attribute vocabulary
5-step UI wizard · planner confirms each segment
| Component | Technology | Why This Choice |
|---|---|---|
| PDF Parser | AWS Textract + PyMuPDF fallback | Textract handles scanned contracts with tables/images; PyMuPDF for native-text PDFs. Both produce structured text + bounding boxes. |
| Embedding Model | OpenAI Ada-002 or Cohere Embed v3 | 1536-dim vectors; best cost/quality for contract-length chunks (512–1024 tokens). |
| Vector Store | PGVector (PostgreSQL ext) or Pinecone | PGVector for simplicity (single DB); Pinecone for scale if >50 concurrent contracts. |
| Agent Framework | LangGraph (LangChain) or AutoGen | Multi-step planning with tool use; LangGraph gives deterministic state machines for the planner-retriever-reader loop. |
| LLM Provider | GPT-4o (primary), Claude 3.5 Sonnet (fallback) | GPT-4o for multi-modal (tables + images); Claude for long-context (200-page single-pass fallback). |
| Job Queue | AWS SQS with DLQ | Managed, serverless, 3-retry with exponential backoff; DLQ for failed extractions. |
| Compute | ECS Fargate (spot for extraction workers) | No GPU needed for embedding; GPU optional for self-hosted LLM fine-tunes. |
| Database | PostgreSQL 16 (RDS) with JSONB | JSONB stores the rule trees natively; relational for contracts, terms, categories, audit log. |
| Frontend | Angular 19 + Vite or Next.js 15 (React 19) | Angular is SPE standard; Next.js offers SSR, React Server Components, and a wider ecosystem. Both support lazy loading + tree-shaking. Component library reusable across Module 1 and Module 2 UIs. |
| Auth | Okta (OIDC + RBAC) | SPE standard SSO; role-based access for planner vs manager vs admin. |
Module 2 takes Engine 1's structured rules + the live title catalog and produces per-title, per-term-year eligibility verdicts, revenue projections, and Make-It-Eligible playbooks.
Rules JSON + Catalog JSON (or API)
Title × Category × Term-Year sweep
Q · W · C · R per (title, category, term)
Eligible · Conditional · Forecast · MKE
Pattern clustering · near-miss detection
Make-It-Eligible levers · visual timeline
| Component | Technology | Why This Choice |
|---|---|---|
| Deterministic Core | Node.js / TypeScript | The evaluator is O(titles × cats × terms × leaves) — pure CPU, no GPU. ~3M leaf evals in <1s in plain JS. TypeScript adds type safety on the three JSON contracts. |
| Agentic Layer | Python (LangGraph) + GPT-4o | Pattern analysis and lever generation need LLM reasoning; LangGraph manages the multi-step agent state. |
| Catalog Ingestion | AtlasMock REST API + IMDb/TMDb hooks | Internal master-data API for proprietary fields; external APIs for public metadata (box office, cast, screens). |
| RMS Integration | REST/GraphQL adapters (Rightsline, FilmTrack) | Two-way: read existing grants, write back new commitments. Adapter pattern so RMS is swappable. |
| Results Store | PostgreSQL (same cluster as Module 1) | Per-title, per-term eligibility rows; JSONB for the perTerm array; relational for aggregations. |
| Cache | Redis (ElastiCache) | Cache catalog snapshots; pub/sub for engine:results event to UI via WebSocket. |
| Visualization | Angular 19 + D3.js / Chart.js or Next.js 15 + D3.js / Recharts | Strip-Gantt (D3), bar charts (Chart.js / Recharts), term-year timeline. Same SPA shell as Module 1. Next.js offers SSR for SEO and faster initial paint. |
| Export | REST API + CSV/Excel/PDF generators | Per-deal, per-term, planner-ready exports. Salesforce custom object sync via Heroku Connect or API. |
| Resource | Specification | Est. Monthly |
|---|---|---|
| ECS Fargate (all tasks) | ~12 tasks avg, mix of on-demand + spot | $800–1,200 |
| RDS PostgreSQL | db.r6g.large, Multi-AZ, 100GB | $450 |
| ElastiCache Redis | cache.r6g.large, 1 replica | $280 |
| S3 | ~50GB PDFs + artifacts | $5 |
| SQS / SNS / EventBridge | ~100K messages/month | $10 |
| OpenAI API (GPT-4o + Ada) | ~500 contracts/month × ~$3/contract | $1,500 |
| AWS Textract | ~500 PDFs × ~100 pages | $750 |
| CloudWatch + X-Ray | Logs + traces + alarms | $100 |
| Secrets Manager + KMS | ~20 secrets | $15 |
| Total Estimated | ~$3,900–4,300/mo | |
All infra as code. Separate stacks for networking, data, compute, observability. Environment promotion: dev → QA → staging → prod.
PR → lint → unit test → build → integration test → deploy to dev. Merge to main → deploy to QA. Tag → prod. Blue/green deployments on ECS.
smoke.test.mjs runs the deterministic evaluator against known truth tables. Integration tests hit the full API stack. LLM extraction tests use golden PDFs with expected JSON diffs.
Dev — single-task, shared RDS. QA — full topology, synthetic data. Staging — prod parity, real contract samples. Prod — Multi-AZ, auto-scale, WAF.
| Contract | Owner | Shape | Production Endpoint |
|---|---|---|---|
contract.json | Deal team / Module 1 | Header, terms, categories | GET /api/contracts/{id} |
rules.json | Module 1 extraction pipeline | 4 segments × N categories | GET /api/contracts/{id}/rules |
catalog.json | Master-data system (AtlasMock) | Title records × attributes | GET /api/catalog?territory={t} |