CLIP™ — Technical Architecture & Resource Plan

1 · Overall Platform Architecture

CLIP™ is a three-tier, two-engine platform. The tiers map to business capability; the engines map to runtime workloads. Both engines share a common data layer and a unified API gateway.

Business context: Sales planners currently perform a 6-step manual workflow every quarter — re-reading contracts, matching titles, confirming rights (across title × territory × language × media type × license type × window), calculating fees, and ad-hoc exploring near-miss titles. CLIP automates steps 1–4 completely and adds systematic intelligence to step 5 (Make-It-Eligible). The architecture below is designed to evaluate every combination of those six dimensions in real-time.

Architecture Principles

Principle 1

Engine Isolation

Module 1 and Module 2 are independently deployable services. They share a data layer but have zero runtime coupling — Engine 2 consumes Engine 1's output as JSON, not via function calls.

Principle 2

Declarative Rules, Tiny Evaluator

Rules are data, not code. A generic recursive evaluator handles every contract. New attributes require zero code changes — only master-data additions.

Principle 3

Event-Driven Communication

Engine 1 emits contract:rules-confirmed. Engine 2 subscribes and re-runs eligibility. No polling. No tight coupling. New consumers subscribe without engine changes.

Principle 4

Catalog Is the Ground Truth

Engine 2 publishes the attribute vocabulary that Engine 1's LLM must use. The feedback loop prevents hallucinated field names.

2 · Module 1 — Contract Rule Extraction Engine

Module 1 transforms an 80–200 page PDF into structured, executable rules across four segments: Qualifiers, Start Date Window, Caps, and Rate Cards.

2.1 Pipeline Flow

Upload

PDF → S3 → SQS message

→

Ingest

OCR · layout · chunk into semantic sections

→

Embed

Chunks → vector store for RAG retrieval

→

Agent Plan

Planner identifies term years, categories, segment boundaries

→

Extract

5 specialist LLMs run in parallel (Q · W · C · R · Amend)

→

Verify

Grounded against catalog attribute vocabulary

→

Review

5-step UI wizard · planner confirms each segment

2.2 Detailed Architecture Diagram

2.3 Technology Choices — Module 1

Component	Technology	Why This Choice
PDF Parser	AWS Textract + PyMuPDF fallback	Textract handles scanned contracts with tables/images; PyMuPDF for native-text PDFs. Both produce structured text + bounding boxes.
Embedding Model	OpenAI Ada-002 or Cohere Embed v3	1536-dim vectors; best cost/quality for contract-length chunks (512–1024 tokens).
Vector Store	PGVector (PostgreSQL ext) or Pinecone	PGVector for simplicity (single DB); Pinecone for scale if >50 concurrent contracts.
Agent Framework	LangGraph (LangChain) or AutoGen	Multi-step planning with tool use; LangGraph gives deterministic state machines for the planner-retriever-reader loop.
LLM Provider	GPT-4o (primary), Claude 3.5 Sonnet (fallback)	GPT-4o for multi-modal (tables + images); Claude for long-context (200-page single-pass fallback).
Job Queue	AWS SQS with DLQ	Managed, serverless, 3-retry with exponential backoff; DLQ for failed extractions.
Compute	ECS Fargate (spot for extraction workers)	No GPU needed for embedding; GPU optional for self-hosted LLM fine-tunes.
Database	PostgreSQL 16 (RDS) with JSONB	JSONB stores the rule trees natively; relational for contracts, terms, categories, audit log.
Frontend	Angular 19 + Vite or Next.js 15 (React 19)	Angular is SPE standard; Next.js offers SSR, React Server Components, and a wider ecosystem. Both support lazy loading + tree-shaking. Component library reusable across Module 1 and Module 2 UIs.
Auth	Okta (OIDC + RBAC)	SPE standard SSO; role-based access for planner vs manager vs admin.

3 · Module 2 — Catalog Aggregator & Insights Engine

Module 2 takes Engine 1's structured rules + the live title catalog and produces per-title, per-term-year eligibility verdicts, revenue projections, and Make-It-Eligible playbooks.

3.1 Engine 2 Pipeline

Ingest

Rules JSON + Catalog JSON (or API)

→

Join

Title × Category × Term-Year sweep

→

Evaluate

Q · W · C · R per (title, category, term)

→

Segment

Eligible · Conditional · Forecast · MKE

→

Analyze

Pattern clustering · near-miss detection

→

Narrate

Make-It-Eligible levers · visual timeline

3.2 Architecture Diagram

3.3 Technology Choices — Module 2

Component	Technology	Why This Choice
Deterministic Core	Node.js / TypeScript	The evaluator is O(titles × cats × terms × leaves) — pure CPU, no GPU. ~3M leaf evals in <1s in plain JS. TypeScript adds type safety on the three JSON contracts.
Agentic Layer	Python (LangGraph) + GPT-4o	Pattern analysis and lever generation need LLM reasoning; LangGraph manages the multi-step agent state.
Catalog Ingestion	AtlasMock REST API + IMDb/TMDb hooks	Internal master-data API for proprietary fields; external APIs for public metadata (box office, cast, screens).
RMS Integration	REST/GraphQL adapters (Rightsline, FilmTrack)	Two-way: read existing grants, write back new commitments. Adapter pattern so RMS is swappable.
Results Store	PostgreSQL (same cluster as Module 1)	Per-title, per-term eligibility rows; JSONB for the perTerm array; relational for aggregations.
Cache	Redis (ElastiCache)	Cache catalog snapshots; pub/sub for engine:results event to UI via WebSocket.
Visualization	Angular 19 + D3.js / Chart.js or Next.js 15 + D3.js / Recharts	Strip-Gantt (D3), bar charts (Chart.js / Recharts), term-year timeline. Same SPA shell as Module 1. Next.js offers SSR for SEO and faster initial paint.
Export	REST API + CSV/Excel/PDF generators	Per-deal, per-term, planner-ready exports. Salesforce custom object sync via Heroku Connect or API.

4 · Cloud Resource Plan

4.1 AWS Resource Map

Compute

ECS Fargate

API Service — 2 tasks, 1 vCPU / 2GB, ALB fronted
Extraction Workers — 0–8 tasks (auto-scale on SQS depth), 2 vCPU / 4GB, Spot
Engine 2 Processor — 2 tasks, 2 vCPU / 4GB
Agentic Layer — 1–4 tasks, 2 vCPU / 8GB (LLM calls are I/O-bound, not CPU)

Storage

S3 + RDS + ElastiCache

S3 — PDF bucket (versioned, lifecycle to Glacier after 1yr)
RDS PostgreSQL 16 — db.r6g.large, Multi-AZ, 100GB gp3
ElastiCache Redis — cache.r6g.large, 1 replica

Messaging

SQS + SNS + EventBridge

SQS — extraction-job-queue + DLQ (3 retries, 30s visibility)
SNS — engine-results topic (fan-out to UI WebSocket, Salesforce sync, audit)
EventBridge — catalog:changed events from AtlasMock

AI/ML

External LLM APIs

OpenAI GPT-4o — primary extraction + multi-modal
Anthropic Claude 3.5 — long-context fallback (200k tokens)
OpenAI Ada-002 — embeddings for vector store
AWS Textract — PDF OCR + table extraction

Security

IAM + Secrets + WAF

Okta — OIDC SSO, RBAC (planner, manager, admin)
Secrets Manager — API keys, DB creds, LLM tokens
WAF — rate limiting on upload endpoint
KMS — encryption at rest for S3 + RDS

Observability

CloudWatch + X-Ray

CloudWatch Logs — structured JSON logs from all services
X-Ray — distributed tracing across API → SQS → Worker → LLM
CloudWatch Alarms — DLQ depth, extraction latency P95, API 5xx rate
Dashboards — extraction throughput, eligibility run time, LLM cost/call

4.2 Estimated Monthly Cost (Production)

Resource	Specification	Est. Monthly
ECS Fargate (all tasks)	~12 tasks avg, mix of on-demand + spot	$800–1,200
RDS PostgreSQL	db.r6g.large, Multi-AZ, 100GB	$450
ElastiCache Redis	cache.r6g.large, 1 replica	$280
S3	~50GB PDFs + artifacts	$5
SQS / SNS / EventBridge	~100K messages/month	$10
OpenAI API (GPT-4o + Ada)	~500 contracts/month × ~$3/contract	$1,500
AWS Textract	~500 PDFs × ~100 pages	$750
CloudWatch + X-Ray	Logs + traces + alarms	$100
Secrets Manager + KMS	~20 secrets	$15
Total Estimated		~$3,900–4,300/mo

5 · Infrastructure & DevOps

IaC

Terraform / AWS CDK

All infra as code. Separate stacks for networking, data, compute, observability. Environment promotion: dev → QA → staging → prod.

CI/CD

GitHub Actions

PR → lint → unit test → build → integration test → deploy to dev. Merge to main → deploy to QA. Tag → prod. Blue/green deployments on ECS.

Testing

Headless Smoke + Integration

smoke.test.mjs runs the deterministic evaluator against known truth tables. Integration tests hit the full API stack. LLM extraction tests use golden PDFs with expected JSON diffs.

Environments

4 Environments

Dev — single-task, shared RDS. QA — full topology, synthetic data. Staging — prod parity, real contract samples. Prod — Multi-AZ, auto-scale, WAF.

6 · End-to-End Data Flow

The Three Data Contracts

Contract	Owner	Shape	Production Endpoint
`contract.json`	Deal team / Module 1	Header, terms, categories	`GET /api/contracts/{id}`
`rules.json`	Module 1 extraction pipeline	4 segments × N categories	`GET /api/contracts/{id}/rules`
`catalog.json`	Master-data system (AtlasMock)	Title records × attributes	`GET /api/catalog?territory={t}`