An API over the working area, shaped like Bible concerns

The hackathon shipped a client and a server. The dashboard got the attention, but the server is the real interest. A dashboard is one consumer. The API is the thing many tools can build on — anyone scaffolding a new project from WACS, grouping projects by language, or asking a domain question about the working area rather than the published set.

That is the gap. The public data API is already Bible-shaped, and it does carry unpublished content — but getting at it is the problem. You query over GraphQL (obtuse outside the explorer), you often have to fetch the zip yourself to get the actual files, and you have to hope it’s current: updates ride the rendering event bus, so there’s latency, and a dropped webhook leaves it silently stale. We want a Bible-shaped API over the working area — fresh, at any commit, no zip round-trip.

A second consumer, concretely

The scripture editor prototype already needs these exact questions and has nowhere good to ask them. To list a user’s editable projects it hits Gitea’s REST /repos/search, paginates the whole instance, then filters client-side by canWrite and by topic (which repos are editor-compatible) — with a code comment that without uid-scoping the list “reads as empty even when they own writable repos.” That’s Gitea’s single-facet search limit in production. For the Bible-domain facets it leans on the public data API, and pays the ergonomics-and-latency tax above.

One co-located API answers most of that directly — editor-topic repos in language X, owned by this user — fresh, at the current head. The honest gap is canWrite: write permission is Gitea auth state, not domain metadata, so this API can’t cheaply own it. But it can shrink the problem — narrow by topic, language, and owner first, then the client checks write-access on a handful of repos instead of paginating the whole instance. The aspiration is a faster path to canWrite; whether one exists here is open.

100%

Drag to pan · ⌘/Ctrl-scroll or the controls to zoom · double-click to reset.Today: each consumer stitches Gitea search (single-facet) and the public data API (GraphQL + bus latency). Tomorrow: one fresh, Bible-shaped API over the bare repos.

The core move: read git directly, co-located

Run the server on the same box as the git host and read the bare repos straight off disk. Gitea stores bare repos — there is no working tree to disturb — so this is pure reads against the object store, never a checkout.

Just shell out to git. git cat-file --batch streams hundreds of blobs through one long-lived process; git ls-tree -r <ref> lists a tree at a ref. It’s exactly what Gitea does on that box, so packs, alternates, and LFS all work for free. No FFI, no second git implementation.
Read at any ref, never mutate. show, cat-file, ls-tree, archive all read the object DB and touch nothing. Only git checkout writes a working tree, and we never call it.
Concurrent reads are fine — git’s object store is built for many simultaneous readers; it’s how Gitea serves every clone and web view. The only footnote: a gc/repack mid-read can rarely blip, and a retry covers it.
Zero extra content storage. v1 duplicated working trees into a cache. This reads bytes Gitea already allocated. The DB holds only derived metadata.

The store can live elsewhere — outbound cloud Postgres (like the public data API) or local SQLite. That choice is deferred below; it changes deployment, not the design.

Freshness by soft polling

A sweep asks for the head SHA per repo — from the git API, or just the refs on disk since we’re local — diffs against the last SHA we processed, and reprocesses only what moved.

The payoff: fresh, content-level QA

With usfm-onion in process and full file access, every head commit gets validated as it changes — USFM structure, lint, and dumb-but-critical checks like merge-conflict markers.

One Finding type, one view for all checks

The QA above only pays off if every checker reports in the same shape, so one view can query and render across all of them. That shape already exists — the scripture editor’s source-agnostic Finding. At heart it is one idea: point at a span of text and say something about it. Crucially it is pure data — no message strings, no behavior. The localized message and any fix action attach later, at the UI edge, so a pipeline can produce a Finding and a locale switch can’t stale it.

The spine:

Identity — a deterministic id from canonical fields (not the message, which churns), so consumers diff findings by key instead of repainting.
What’s said — source (onion / sous-chef / local-lint / a future analyzer) + code + severity + category (structure vs content).
Where — an anchor. The editor uses two: a live token-id, or a content range (sid, Utf16Span). A server over repos has no live token tree, so its natural address is the verse — book:chapter:verse (the sid) — plus an optional character span when a check can be precise (UTF-8 bytes in Rust, UTF-16 at the JS boundary).

Producers plug in by source: usfm-onion (structure), scripture-sous-chef (content / lexical), and — same shape — a future AI marking phrases that hide NT linguistic features. None know about the others.

Stored keyed by repo + head SHA, the findings are one queryable set: the dashboard reports across all checks in a single view, the editor overlays the very same findings inline — no per-tool adapter, no second book picker divorced from the text.

100%

Drag to pan · ⌘/Ctrl-scroll or the controls to zoom · double-click to reset.Any checker emits the same source-agnostic Finding — a located span plus what's said about it, pure data. Stored per head SHA; the localized message attaches at each view.

Scope: the git host, and nothing it shouldn’t own

The server treats the git host as its only source of truth. It does not own publication — show_on_biel, PORT, primary status. Those belong to the publishing model in the publishing proposal. The one outbound read worth allowing is the public data API for language normalization — names, IETF codes, gateway relationships. We consult that; we don’t reabsorb it.

This is not a fork

v1’s principle was “external, consume the public API.” Reading bare repos off disk changes the mechanism, not the principle: we still never modify the git host, own its auth, or cut its releases — we only read its storage. The one new requirement is living on (or mounting) the git host’s box. Worth stating plainly so it isn’t heard as the fork-Gitea path others took. It isn’t.

How this differs from the public data API

Complementary systems, different jobs. The case is the separation, not replacement.

	Public data API (BIEL / explorer)	This server
Shape	Bible-domain	Bible-domain
Ergonomics	GraphQL — strong in the explorer, obtuse outside it; fetch the zip yourself for files	Plain queries; content served directly, no zip round-trip
Freshness	Rendering-bus latency; a dropped webhook leaves it silently stale	Fresh — soft-poll on head SHA, self-healing
Version	The rendered version	Any SHA, always the current head
Source of truth	Pushed-in rows + render metadata	The bare git repos, read directly
Coverage	Has unpublished content, but uneven to reach	Every public repo, including broken, at its head
Validity	No content-level concept of valid	Lint, USFM structure, conflict markers, per head
Owns publication?	Yes — its job	No — defers to the publishing model

Tradeoffs, honestly

Must live near the git storage. We trade the portability of an external API client for the locality that makes everything cheap.
The first full pass is heavy (~25k repos), then incremental by SHA diff. Local disk reads make even the cold crawl far cheaper than over the wire.
It doesn’t replicate the language reference data the public data API curates — hence the one outbound read above.

Open questions

DB locality — decide later. Local SQLite (simplest, couples to the box) vs. cloud Postgres (independent, durable, network hop on writes).
Freshness source. Git API for head SHAs, or refs straight off disk?
How far does QA go? Lint and conflict markers are clear wins. Proportion-vs-source, coverage, lexical analysis — where does scope stop?
canWrite — can it be fast here? Write permission is Gitea auth, not public domain data, and the project is public-scoped. Narrowing candidates by facet first helps, but is there a faster path to the write-check itself without pulling Gitea’s auth model into this service? Open.