Sprint 3: SEC EDGAR Ingestion
Goal
Ingest real SEC filings by ticker, form type, year, or accession number, then answer filing questions with section-aware citations.
Why This Sprint Matters
Financial intelligence requires real source documents. SEC EDGAR filings are authoritative, but they are noisy, long, and require careful parsing, throttling, and citation handling.
What Was Built
- SEC ticker / CIK lookup
- Latest, year-based, and accession-based filing selection
- SEC request retry and throttling
- Filing HTML cleaning and section parser
- Section-aware chunk citations
- Frontend controls for live SEC ingestion
sec-smokeevaluation suite
Architecture / Workflow
mermaid
flowchart TD
Request[Ingest SEC Request] --> Select[Filing Selection]
Select --> Download[SEC Download]
Download --> Clean[HTML Clean / Mojibake Repair]
Clean --> Sections[Section Parser]
Sections --> Chunks[Chunk With Section Metadata]
Chunks --> Embeddings[Provider Embeddings]
Embeddings --> Qdrant[(Qdrant)]Key Files And APIs
backend/app/services/sec_edgar_client.pybackend/app/services/sec_parser.pyPOST /api/ingest/secPOST /api/evals/run
Validation Commands
powershell
Invoke-RestMethod -Method Post http://localhost:8000/api/ingest/sec `
-ContentType "application/json" `
-Body '{"source":"edgar","ticker":"AAPL","form_type":"10-K","filing_year":2025}'Demo Talking Points
Show that the platform can move beyond sample text and retrieve real SEC risk-factor evidence with accession-aware citations.
What Changed From Previous Sprint
Sprint 2 made retrieval persistent. Sprint 3 improves the source quality by connecting to real SEC filings.