NSC Labs
Inference Gateway

Heptaconn

Deterministic inference gateway for LLM servers (e.g., llama.cpp). Bounded inflight calls, receipts, and explicit backpressure for local inference compute.

Implementation: Go. Built for self-hosted inference environments.

Bounded inflight Receipts Backpressure Explicit failures

Overview

Heptaconn sits between clients and inference backends. It governs inference compute (not just HTTP): deterministic admission, bounded concurrency, and receipts describing outcomes (accepted, queued, throttled, timed out).

Problem

Direct-to-backend inference under load often leads to unbounded concurrency, retry amplification, timeouts, and inconsistent accounting.

Goal

Keep inference compute stable. When things fail, fail clearly (overload vs malformed vs backend failure) and keep records you can aggregate.

Fit

Works well with telemetry sinks: receipts + call records become an audit trail instead of a pile of logs.

Core mechanics

Bounded inflight

A hard cap on concurrent inference work. Queueing exists, but it is intentional and measurable.

  • max inflight
  • queue policy (explicit)
  • timeouts and cancellation semantics
Receipts

Each request has a lifecycle record: accepted → running → completed/failed/cancelled.

  • request_id
  • timestamps and durations
  • exit status + error taxonomy
Stability controls

Backpressure is a first-class response, not an accident.

  • overload responses are early
  • retry hints are explicit
  • limits are enforced, not suggested

If a client cannot determine the outcome from the receipt, the interface is not specific enough.

Quick demo

Heptamini ships two binaries:

  • heptaconnd — gateway daemon (server)
  • heptaconnc — client for sending test requests
Terminal 1

Start the gateway daemon.

heptaconnd --listen 0.0.0.0:7777
Terminal 2

Send a small payload through the gateway.

printf "hi" > /tmp/p.bin
heptaconnc --addr 127.0.0.1:7777 --file /tmp/p.bin --chunk 8
What you should see

A deterministic receipt: accepted/queued/throttled plus timing. Under load, inflight stays bounded and backpressure becomes explicit.

Flags and defaults may change. The intent is to keep limits explicit and receipts versioned.

Web demo

Heptaconn can be driven from a browser console while inference stays on your machine. The browser talks to the gateway; a local Heptamini node connects outbound and forwards requests to your local backend (e.g., llama.cpp).

Flow
browser → heptaconn (cloud) → heptamini (local) → llama.cpp (local)
                                    ↑
                               outbound link

Heptaconn does not host a model in this setup.

How to try
  1. Open /demo/heptaconn-web.html and generate a pairing code.
  2. Run Heptamini locally and connect using the pairing code.
  3. Send requests from the web console. Receipts are shown for each request.

Demo limits may apply (short sessions, bounded inflight).

Output
  • receipt (always): accepted/queued/throttled + timing
  • optional response preview: truncated (e.g., 256 chars)

Use receipts to verify backpressure and stability under load.

Installation

Run in front of backends

Point clients to Heptaconn; configure backends and concurrency limits.

# Typical shape (conceptual)
# client → heptaconn → llama.cpp (or other local backend)

# Configure:
# - backend targets
# - max inflight
# - timeouts
# - receipt format/version
Binary

Heptamini is the free binary distribution used for evaluation and testing.

Download binaries (Heptamini)

Pricing

Heptaconn is distributed privately while the feature boundary stabilizes. Pricing is a placeholder.

Evaluation (Heptamini)

Public binaries for testing the gateway and client tools.

Pro (placeholder)

Multi-tenant limits, richer routing, extended receipts.

Enterprise (placeholder)

Support contract and custom deployment constraints.

Contact

Questions: nsc@newssourcecrawler.com

← Back to homepage