NewSpace.fyi

An open-data ingestion pipeline for the commercial spaceflight industry. Every four hours a serverless cron job harvests structured intelligence from 79 active RSS feeds, processes each article through a domain-tuned LLM, and writes both article rows and a normalised entity graph to a public Supabase database. Zero persistent servers — everything runs on free tiers.

Architecture

Zero persistent servers
GitHub ActionsPython · every 4hfeedparser · httpxgpt-4o-miniJSON mode · T=0.2summary · entities · tagsSupabasePostgreSQL · Storagearticles · entity graphNext.jsVercelISR · 5 minscrapeupsertreadbriefing (LLM re-read)79 RSS sources → ~200 articles / day → 30-day rolling window

Every component runs on a free-tier managed service. The whole system costs $0/month to operate.

Ingestion pipeline

01

Source ingestion

A GitHub Actions cron polls 79 active RSS endpoints every 4 hours via feedparser + httpx. Sources span specialist trade press (SpaceNews, Payload, Ars), corporate IR (Rocket Lab, Iridium, SES, AST SpaceMobile…), agencies (NASA program offices, ESA, JAXA, UKSA, NRO, FAA, FCC), defense and policy (DARPA, Breaking Defense, Space Policy Online), analyst and podcast feeds (Quilty, MECO, Off-Nominal), and PR Newswire wire copy. Per-source rate limits and URL-fingerprint dedup prevent duplicate processing across syndicators.

02

Article extraction

For each entry the pipeline scrapes the article page for og:image and full body text. BeautifulSoup strips boilerplate, the cleaned text is truncated to 4,000 characters, and a search context string is built for downstream LLM disambiguation. Mixed-domain feeds (general defense, regulator dockets, podcasts) are gated by an LLM relevance check before summarisation.

03

LLM analysis (gpt-4o-mini)

Each article is sent to gpt-4o-mini with a single structured-JSON prompt (temperature 0.2). The model returns: a 60-word wire-service summary with mandatory bold-entity markers, exactly one of 12 segment tags, 2–5 normalized entity tags, an image search query, per-term keyword queries for in-summary disambiguation, and structured extraction of entities (companies, agencies, programs, vehicles), directed relations, timestamped events (funding/launch/contract/partnership), and people. Off-target summaries trigger one automatic retry.

04

Image resolution

Four-tier waterfall: (1) RSS media/enclosure URL, (2) og:image / twitter:image scraped from the page, (3) Brave Image Search API using the LLM-generated image_query, with domain-ranked filtering that blocks stock-photo hosts and prefers NASA/ESA/manufacturer imagery, (4) per-source fallback in Supabase Storage.

05

Entity-graph upsert

Extracted entities are resolved through a tag_aliases map so variant spellings ("SpaceX", "Space Exploration Technologies") collapse onto canonical slugs. Four PostgreSQL RPCs run per article — touch_entity, touch_relation, touch_event, append_entity_person — to upsert into a normalised entity graph (entities + entity_relations + entity_events). Investor → portfolio edges are merged with curated seed data so each entity page surfaces relations, recent events, and people in one query.

06

Storage, pruning, daily briefing

Clean rows are upserted to Supabase PostgreSQL via supabase-py, deduped on original_url, with a GIN-indexed entity_tags column for fast tag filtering. Articles older than 30 days are pruned each run to stay within Supabase free-tier limits. On the read side, a second LLM pass produces a 60-word and a 280-word daily briefing per UTC day at request time, with bold entities re-linked to per-term Google search queries.

Entity graph

Beyond per-article rows, the pipeline maintains a continuously-updated NewSpace knowledge graph: entities (companies, agencies, programs, vehicles, investors, people), directed relations (operates / built-by / invested-in / part-of / launched-on / regulates …), and timestamped events (fundings, launches, contracts, partnerships). Tag-alias resolution keeps variant spellings collapsed onto canonical slugs so every mention strengthens the same node.

entities
companies, agencies, programs, vehicles, investors, people
relations
16 typed edges between entities — built-by, invested-in, part-of …
events
funding · launch · contract · partnership · regulatory
Browse the index →

Stack

Ingestion
Python 3.12 · feedparser · httpx · BS4
LLM
OpenAI gpt-4o-mini (temp 0.2, JSON mode)
Images
Brave Image Search API (ranked, filtered)
Database
Supabase PostgreSQL · GIN + FTS indexes
Frontend
Next.js 14 App Router · Tailwind CSS 3
Hosting
Vercel · GitHub Actions (4h cron)