CLIP™ Executive Deck Pitch Case Study Workshop Architecture Technical PRD

Technical Architecture & Resource Plan

An engineer's take on how to build CLIP™ — cloud resources, service boundaries, data flows and the technology stack for both engines.

1 · Overall Platform Architecture

CLIP™ is a three-tier, two-engine platform. The tiers map to business capability; the engines map to runtime workloads. Both engines share a common data layer and a unified API gateway.

Business context: Sales planners currently perform a 6-step manual workflow every quarter — re-reading contracts, matching titles, confirming rights (across title × territory × language × media type × license type × window), calculating fees, and ad-hoc exploring near-miss titles. CLIP automates steps 1–4 completely and adds systematic intelligence to step 5 (Make-It-Eligible). The architecture below is designed to evaluate every combination of those six dimensions in real-time.

PRESENTATION ENGINE 1 · Contract Rule Extraction ENGINE 2 · Catalog Aggregator & Insights SHARED DATA LAYER INFRASTRUCTURE & OBSERVABILITY SPA (Angular / Next.js) Module 1 + Module 2 API Gateway REST · WebSocket Auth (Okta) OIDC · RBAC · MFA PDF Ingestion OCR · Layout · Chunk Agentic RAG Planner · Retriever · Reader LLM Cluster Q · W · C · R · Amend Verifier Catalog-grounded Review UI 5-step wizard SQS / Job Queue 3 retries · DLQ S3 · PDF Storage Versioned · Lifecycle Vector Store (PGVector) RAG retrieval · embeddings Audit Log Append-only · AI vs User Aggregator Rules × Catalog join Pattern Analyzer Near-miss clusters Eligibility Reasoner What-if · Booster Insight Narrator Visual · Multi-year Integrations SF · ERP · RMS PostgreSQL + JSONB Contracts · Rules · Results Redis (cache + pub/sub) Engine events · sessions S3 (docs + artifacts) PDFs · exports · snapshots Catalog API (AtlasMock) Titles · Attributes · Live ECS Fargate CloudWatch + X-Ray Terraform / CDK GitHub Actions CI Secrets Manager rules.json

Architecture Principles

Principle 1

Engine Isolation

Module 1 and Module 2 are independently deployable services. They share a data layer but have zero runtime coupling — Engine 2 consumes Engine 1's output as JSON, not via function calls.

Principle 2

Declarative Rules, Tiny Evaluator

Rules are data, not code. A generic recursive evaluator handles every contract. New attributes require zero code changes — only master-data additions.

Principle 3

Event-Driven Communication

Engine 1 emits contract:rules-confirmed. Engine 2 subscribes and re-runs eligibility. No polling. No tight coupling. New consumers subscribe without engine changes.

Principle 4

Catalog Is the Ground Truth

Engine 2 publishes the attribute vocabulary that Engine 1's LLM must use. The feedback loop prevents hallucinated field names.

2 · Module 1 — Contract Rule Extraction Engine

Module 1 transforms an 80–200 page PDF into structured, executable rules across four segments: Qualifiers, Start Date Window, Caps, and Rate Cards.

2.1 Pipeline Flow

1
Upload

PDF → S3 → SQS message

2
Ingest

OCR · layout · chunk into semantic sections

3
Embed

Chunks → vector store for RAG retrieval

4
Agent Plan

Planner identifies term years, categories, segment boundaries

5
Extract

5 specialist LLMs run in parallel (Q · W · C · R · Amend)

6
Verify

Grounded against catalog attribute vocabulary

7
Review

5-step UI wizard · planner confirms each segment

2.2 Detailed Architecture Diagram

Sales Planner API Layer POST /contracts/upload S3 · PDF Object Store SQS Queue 3 retries · exp backoff Extraction Worker ECS Fargate · GPU optional Pulls from SQS · auto-scales PDF Parser Textract / PyMuPDF Chunker Semantic sections Embedder Ada-002 / Cohere Vector Store Pinecone / PGVector Agent Planner LangGraph / AutoGen Specialist LLM Cluster (parallel execution) Qualifier LLM Start Date LLM Caps LLM Rate Card LLM Amendment Diff LLM Verifier & Confidence Scorer Catalog-grounded · 0.0–1.0 per field PostgreSQL · JSONB Contracts · Rules · Audit 5-Step Review UI (Angular / Next.js)

2.3 Technology Choices — Module 1

ComponentTechnologyWhy This Choice
PDF ParserAWS Textract + PyMuPDF fallbackTextract handles scanned contracts with tables/images; PyMuPDF for native-text PDFs. Both produce structured text + bounding boxes.
Embedding ModelOpenAI Ada-002 or Cohere Embed v31536-dim vectors; best cost/quality for contract-length chunks (512–1024 tokens).
Vector StorePGVector (PostgreSQL ext) or PineconePGVector for simplicity (single DB); Pinecone for scale if >50 concurrent contracts.
Agent FrameworkLangGraph (LangChain) or AutoGenMulti-step planning with tool use; LangGraph gives deterministic state machines for the planner-retriever-reader loop.
LLM ProviderGPT-4o (primary), Claude 3.5 Sonnet (fallback)GPT-4o for multi-modal (tables + images); Claude for long-context (200-page single-pass fallback).
Job QueueAWS SQS with DLQManaged, serverless, 3-retry with exponential backoff; DLQ for failed extractions.
ComputeECS Fargate (spot for extraction workers)No GPU needed for embedding; GPU optional for self-hosted LLM fine-tunes.
DatabasePostgreSQL 16 (RDS) with JSONBJSONB stores the rule trees natively; relational for contracts, terms, categories, audit log.
FrontendAngular 19 + Vite or Next.js 15 (React 19)Angular is SPE standard; Next.js offers SSR, React Server Components, and a wider ecosystem. Both support lazy loading + tree-shaking. Component library reusable across Module 1 and Module 2 UIs.
AuthOkta (OIDC + RBAC)SPE standard SSO; role-based access for planner vs manager vs admin.

3 · Module 2 — Catalog Aggregator & Insights Engine

Module 2 takes Engine 1's structured rules + the live title catalog and produces per-title, per-term-year eligibility verdicts, revenue projections, and Make-It-Eligible playbooks.

3.1 Engine 2 Pipeline

1
Ingest

Rules JSON + Catalog JSON (or API)

2
Join

Title × Category × Term-Year sweep

3
Evaluate

Q · W · C · R per (title, category, term)

4
Segment

Eligible · Conditional · Forecast · MKE

5
Analyze

Pattern clustering · near-miss detection

6
Narrate

Make-It-Eligible levers · visual timeline

3.2 Architecture Diagram

Engine 1 Output rules.json · contract.json Title Catalog AtlasMock API · catalog.json External Metadata IMDb · TMDb · BO Mojo Deterministic Core evalQualifier() evalWindow() evalCaps() evalRate() Agentic Layer Pattern Analyzer Eligibility Reasoner MKE Lever Generator Insight Narrator Dashboard UI Salesforce ERP / Finance Tableau / Power BI Rights Management Rightsline · FilmTrack · ERP RMS bidirectional Event Bus (SNS/SQS) engine:results · catalog:changed PostgreSQL · Results Store Redis · Cache + Pub/Sub

3.3 Technology Choices — Module 2

ComponentTechnologyWhy This Choice
Deterministic CoreNode.js / TypeScriptThe evaluator is O(titles × cats × terms × leaves) — pure CPU, no GPU. ~3M leaf evals in <1s in plain JS. TypeScript adds type safety on the three JSON contracts.
Agentic LayerPython (LangGraph) + GPT-4oPattern analysis and lever generation need LLM reasoning; LangGraph manages the multi-step agent state.
Catalog IngestionAtlasMock REST API + IMDb/TMDb hooksInternal master-data API for proprietary fields; external APIs for public metadata (box office, cast, screens).
RMS IntegrationREST/GraphQL adapters (Rightsline, FilmTrack)Two-way: read existing grants, write back new commitments. Adapter pattern so RMS is swappable.
Results StorePostgreSQL (same cluster as Module 1)Per-title, per-term eligibility rows; JSONB for the perTerm array; relational for aggregations.
CacheRedis (ElastiCache)Cache catalog snapshots; pub/sub for engine:results event to UI via WebSocket.
VisualizationAngular 19 + D3.js / Chart.js or Next.js 15 + D3.js / RechartsStrip-Gantt (D3), bar charts (Chart.js / Recharts), term-year timeline. Same SPA shell as Module 1. Next.js offers SSR for SEO and faster initial paint.
ExportREST API + CSV/Excel/PDF generatorsPer-deal, per-term, planner-ready exports. Salesforce custom object sync via Heroku Connect or API.

4 · Cloud Resource Plan

4.1 AWS Resource Map

Compute

ECS Fargate

  • API Service — 2 tasks, 1 vCPU / 2GB, ALB fronted
  • Extraction Workers — 0–8 tasks (auto-scale on SQS depth), 2 vCPU / 4GB, Spot
  • Engine 2 Processor — 2 tasks, 2 vCPU / 4GB
  • Agentic Layer — 1–4 tasks, 2 vCPU / 8GB (LLM calls are I/O-bound, not CPU)
Storage

S3 + RDS + ElastiCache

  • S3 — PDF bucket (versioned, lifecycle to Glacier after 1yr)
  • RDS PostgreSQL 16 — db.r6g.large, Multi-AZ, 100GB gp3
  • ElastiCache Redis — cache.r6g.large, 1 replica
Messaging

SQS + SNS + EventBridge

  • SQS — extraction-job-queue + DLQ (3 retries, 30s visibility)
  • SNS — engine-results topic (fan-out to UI WebSocket, Salesforce sync, audit)
  • EventBridge — catalog:changed events from AtlasMock
AI/ML

External LLM APIs

  • OpenAI GPT-4o — primary extraction + multi-modal
  • Anthropic Claude 3.5 — long-context fallback (200k tokens)
  • OpenAI Ada-002 — embeddings for vector store
  • AWS Textract — PDF OCR + table extraction
Security

IAM + Secrets + WAF

  • Okta — OIDC SSO, RBAC (planner, manager, admin)
  • Secrets Manager — API keys, DB creds, LLM tokens
  • WAF — rate limiting on upload endpoint
  • KMS — encryption at rest for S3 + RDS
Observability

CloudWatch + X-Ray

  • CloudWatch Logs — structured JSON logs from all services
  • X-Ray — distributed tracing across API → SQS → Worker → LLM
  • CloudWatch Alarms — DLQ depth, extraction latency P95, API 5xx rate
  • Dashboards — extraction throughput, eligibility run time, LLM cost/call

4.2 Estimated Monthly Cost (Production)

ResourceSpecificationEst. Monthly
ECS Fargate (all tasks)~12 tasks avg, mix of on-demand + spot$800–1,200
RDS PostgreSQLdb.r6g.large, Multi-AZ, 100GB$450
ElastiCache Rediscache.r6g.large, 1 replica$280
S3~50GB PDFs + artifacts$5
SQS / SNS / EventBridge~100K messages/month$10
OpenAI API (GPT-4o + Ada)~500 contracts/month × ~$3/contract$1,500
AWS Textract~500 PDFs × ~100 pages$750
CloudWatch + X-RayLogs + traces + alarms$100
Secrets Manager + KMS~20 secrets$15
Total Estimated~$3,900–4,300/mo

5 · Infrastructure & DevOps

IaC

Terraform / AWS CDK

All infra as code. Separate stacks for networking, data, compute, observability. Environment promotion: dev → QA → staging → prod.

CI/CD

GitHub Actions

PR → lint → unit test → build → integration test → deploy to dev. Merge to main → deploy to QA. Tag → prod. Blue/green deployments on ECS.

Testing

Headless Smoke + Integration

smoke.test.mjs runs the deterministic evaluator against known truth tables. Integration tests hit the full API stack. LLM extraction tests use golden PDFs with expected JSON diffs.

Environments

4 Environments

Dev — single-task, shared RDS. QA — full topology, synthetic data. Staging — prod parity, real contract samples. Prod — Multi-AZ, auto-scale, WAF.

6 · End-to-End Data Flow

Contract Path (Engine 1) PDF Upload S3 + SQS OCR + Chunk RAG + LLM Verify rules.json Catalog Path IMDb / TMDb Internal Media Aggregate Normalise catalog.json Engine 2 Join · Score · Segment Dashboard SF / ERP BI / Export

The Three Data Contracts

ContractOwnerShapeProduction Endpoint
contract.jsonDeal team / Module 1Header, terms, categoriesGET /api/contracts/{id}
rules.jsonModule 1 extraction pipeline4 segments × N categoriesGET /api/contracts/{id}/rules
catalog.jsonMaster-data system (AtlasMock)Title records × attributesGET /api/catalog?territory={t}