1 Concurrency
1 Event Bus described what the bus guarantees — sequential delivery, within-event concurrency — but not how those guarantees are achievable given that some work takes hundreds of milliseconds and the audio device invokes callbacks on a thread the application does not own. This chapter explains the concurrency model from first principles.
Vocalance has three hard timing constraints that cannot all be satisfied on a single naive thread:
The microphone delivers audio every ~30 ms. A late read drops frames.
The UI must repaint at ~60 fps, meaning no single task can take more than ~16 ms between repaints.
Speech recognition takes 100–300 ms. LLM generation takes seconds.
This chapter explains, from first principles, the two concurrency tools Python provides, why Vocalance chooses one as the default and uses the other selectively, and how the two communicate safely.
1.1 The Two Concurrency Models
1.1.1 asyncio: One Thread, Many Tasks
asyncio is a cooperative concurrency system. It runs entirely on a single OS
thread. On that thread lives an event loop that maintains a list of
coroutines — functions defined with async def that can be suspended and
resumed. The loop picks one coroutine, runs it until it encounters an await,
suspends it, picks another, and so on.
The critical property is where suspensions happen: only at ``await``. A
coroutine that never awaits runs to completion without interruption. Nothing
else can execute while it runs. This makes it safe to read and write shared
state between two await points without any locking — no other coroutine
can observe partial state during that window.
The flip side: a coroutine that never yields blocks everything else. A tight loop that runs for 300 ms freezes the event loop — and everything that depends on it, including the UI — for the entire 300 ms. asyncio gives safety and simplicity for anything that spends most of its time waiting; it is harmful for anything that spends most of its time computing.
1.1.2 Threads: Multiple Threads, Preemptive Switching
Threads are the OS’s mechanism for running multiple execution flows. Unlike asyncio, the OS scheduler decides when to switch between threads and does so without any cooperation from the running code — a thread can be preempted mid-statement.
The advantage is that multiple threads can genuinely run in parallel on separate CPU cores. Python’s Global Interpreter Lock (GIL) limits this: the GIL allows only one thread to execute Python bytecode at a time. However, the GIL is released for blocking system calls (file I/O, network I/O) and for most C-extension calls. Every heavy model used in Vocalance — Vosk, YAMNet, Moonshine, and the LLM — is a C extension that releases the GIL during inference. Threads running those models run in true parallel with the main Python thread.
The cost is that preemptive switching makes shared mutable state dangerous. Two threads reading and writing the same variable without coordination can produce results that depend on thread scheduling. Synchronization primitives (locks, queues, futures) are needed wherever threads share state.
1.2 The Vocalance Concurrency Model
1.2.1 The Default: asyncio on One Thread
Almost everything in Vocalance runs on a single OS thread shared by two cooperating event loops: the Qt event loop (which drives the UI) and the asyncio event loop (which drives all services and the bus). The integration is one call:
QtAsyncio.run(start_app())
QtAsyncio (from PySide6) installs an asyncio event loop implementation that
schedules its turns through Qt’s existing event loop. The two loops interleave
on one thread: Qt paints a frame, yields to asyncio, asyncio runs a few
coroutines, yields back to Qt, and so on.
flowchart LR
Thread[Single OS Thread]
Thread --> Qt[Qt event loop<br/>paints, signals]
Thread --> Aio[asyncio event loop<br/>tasks, awaits]
Qt -. yields to .- Aio
Aio -. yields to .- Qt
Every bus dispatch, every service handler, every Qt signal, every coroutine suspension happens on this one thread, in a single total order. This is what gives the bus’s sequential-delivery guarantee its meaning: only one thing can execute at a time.
1.2.2 The Exceptions: Two Categories of Off-Thread Work
Two categories of work cannot run on the main thread without violating the timing constraints.
Category 1: A foreign thread the application does not own. The PortAudio audio driver invokes its callback on a thread it manages. The application cannot avoid this; it is how the audio API works. The callback must copy the audio buffer and return in microseconds, or the driver’s internal buffer overflows and frames are dropped.
Category 2: CPU-heavy synchronous work. Speech recognition, sound embedding, LLM generation, and OS input calls are all blocking operations that take far longer than a UI frame. Running them on the main thread would freeze the loop.
Both categories require getting work off the main thread and getting results back to it.
1.3 The CPU-Heavy Jobs
The following table lists every operation in the application that is too slow to run on the main thread and must be dispatched to a background thread.
Job |
Where in the code |
Cost |
Frequency |
|---|---|---|---|
Vosk command recognition |
|
100–300 ms (C++ Kaldi decoder) |
Once per spoken command clip |
YAMNet sound embedding |
|
50–100 ms (TensorFlow CPU) |
Once per sound clip |
LLM token generation |
|
Seconds total, token-by-token streaming |
Once per Smart / Amend session |
|
|
5–50 ms per call (OS-blocking) |
Per executed command |
Storage I/O |
|
Low single-digit ms |
Per JSON read or write |
Model loading (all four) |
Service |
Seconds to tens of seconds |
Once at startup |
One notable absence from the table is Moonshine streaming inference.
MoonshineStreamSession.add_audio_pcm16 is called on the main thread but is
non-blocking: it only enqueues the PCM bytes onto a bounded thread-safe queue.
A dedicated worker thread (moonshine-feeder) owned by the session drains
the queue in batches and runs the native add_audio_to_stream and
update_transcription calls. The decoder cost of a refresh on a long live
segment can scale super-linearly with the segment duration, so isolating it
from the event loop is essential — the main thread is never blocked by
inference even when a refresh takes hundreds of milliseconds on weaker
hardware.
1.4 The Thread Crossing Primitives
1.4.1 Moving Results from a Foreign Thread to the Main Thread
The only asyncio API that is safe to call from a thread other than the one
running the event loop is loop.call_soon_threadsafe:
loop.call_soon_threadsafe(callable, *args)
When called from any thread, it places callable into a thread-safe internal
queue that the event loop reads on its next tick, and wakes the loop if it is
currently idle. The callable runs on the main thread, in the same single-thread
context as everything else. The foreign thread returns immediately after
scheduling; it does not wait for the callable to execute.
Two wrappers in vocalance/app/lifecycle/worker.py cover the two typical
cases:
def schedule_on_loop(loop, coro) -> None:
loop.call_soon_threadsafe(loop.create_task, coro)
def schedule_on_loop_callback(loop, fn, *args) -> None:
loop.call_soon_threadsafe(fn, *args)
The first wraps a coroutine in a task; the second schedules a synchronous callable. Together they cover every “I am on a foreign thread and need to hand something to the main thread” case in the codebase.
1.4.2 The PortAudio Crossing
The capture service is the one place where a foreign thread must hand data to the main thread on every audio frame — roughly thirty times per second.
flowchart LR
subgraph Foreign["Driver Thread (PortAudio)"]
Drv[PortAudio driver] -->|PCM buffer| CB[_portaudio_callback<br/>copy bytes, record timestamp]
end
subgraph Main["Main Thread"]
Pub[_publish_chunk] -->|AudioChunkCapturedEvent| Bus((Event bus))
end
CB -.->|call_soon_threadsafe| Pub
The callback’s three-step contract: copy the bytes (the buffer pointer is only
valid for the duration of the callback), record the timestamp, and schedule
_publish_chunk on the main thread. The callback returns in microseconds.
_publish_chunk runs on the main thread, constructs the event, and publishes
to the bus. All subscriber dispatch happens on the main thread with no
threading hazards.
1.5 run_blocking: Dispatching Heavy Work to a Background Thread
When the main thread needs to run blocking work, the pattern is the inverse of the audio crossing: spawn a thread, let it run the blocking call, and have it hand the result back to the main thread when done.
run_blocking (vocalance/app/lifecycle/worker.py) implements this:
async def run_blocking(fn, *args, cancel_token=None, name=..., **kwargs) -> T:
loop = asyncio.get_running_loop()
future = loop.create_future()
def worker() -> None:
try:
result = fn(*args, **kwargs)
except BaseException as exc:
loop.call_soon_threadsafe(future.set_exception, exc)
else:
loop.call_soon_threadsafe(future.set_result, result)
threading.Thread(target=worker, daemon=True, name=name).start()
try:
return await future
except asyncio.CancelledError:
if cancel_token is not None:
cancel_token.set()
raise
The calling coroutine creates a Future — a placeholder that will be resolved
when the result is ready. A daemon thread is spawned to run the blocking
function. When the thread finishes, it calls call_soon_threadsafe to resolve
the future on the main thread. The calling coroutine was suspended on
await future; resolving the future schedules it to resume on the next loop
tick. From the caller’s perspective this looks like a normal await.
flowchart LR
Caller[Caller coroutine<br/>main thread] -->|1. create future, spawn thread| W[Daemon thread<br/>runs fn]
W -->|2. call_soon_threadsafe<br/>resolve future| Loop[Event loop<br/>main thread]
Loop -->|3. resume caller| Caller
1.5.1 Why Daemon Threads
daemon=True means the interpreter does not wait for these threads to finish
before exiting. If a C-extension call misbehaves — segfault, deadlock, infinite
loop — it cannot keep the process alive past shutdown. The graceful path is
cooperative cancellation via CancellationToken (below). Daemon is the safety
net when graceful cancellation does not complete.
1.5.2 Why One Thread Per Call
The alternative to one-per-call is a shared thread pool
(ThreadPoolExecutor). The codebase avoids this for two reasons.
First, pool starvation. If the LLM occupies all pool workers for thirty seconds, every other blocking call queues behind it. With one-per-call there is no upper bound on concurrency — Vosk, YAMNet, and a click can all start simultaneously on their own threads.
Second, simplicity of reasoning. Each run_blocking call is independent;
there is no shared pool state to reason about, no worker lifecycle to manage,
and no risk of one call affecting another’s latency.
The cost is slightly higher thread-spawn overhead (~tens of microseconds per call), which is negligible compared to the hundreds of milliseconds of work each thread performs.
1.5.3 Serializing OS Input
One consequence of one-thread-per-call is that multiple callers can dispatch
pyautogui calls simultaneously. If three spoken commands each spawn a thread
for OS input, the OS receives them in thread-scheduling order — not speech order.
KeyboardInputService
(vocalance/app/services/keyboard_input_service.py) solves this by combining
run_blocking with an asyncio.Lock:
class KeyboardInputService(Service):
def __init__(self, event_bus: EventBus) -> None:
super().__init__(event_bus)
self._serial = asyncio.Lock()
async def run(self, fn, *args, **kwargs):
async with self._serial:
return await run_blocking(fn, *args, name="vocalance-input", **kwargs)
The lock is acquired by a coroutine on the main thread before the daemon thread
is spawned, and released after the future resolves. At any moment, at most one
pyautogui call is in flight. asyncio.Lock is used rather than
threading.Lock because the contention is between coroutines on the main
thread, not between threads; an asyncio lock is compatible with
async with and does not block the event loop when waiting.
1.6 Cancellation
During shutdown, background threads must stop cooperatively. The mechanism is
CancellationToken (vocalance/app/lifecycle/cancellation.py), an object
that wraps both an asyncio.Event and a threading.Event and keeps them
in sync.
flowchart LR
subgraph CT["CancellationToken"]
AE[asyncio.Event]
TE[threading.Event]
end
Aio[Awaiting coroutine] -->|await token.wait| AE
Worker[Daemon worker] -->|token.is_set or<br/>threading_event().wait| TE
Set[token.set] --> AE
Set --> TE
Setting the token from any thread sets both internal events simultaneously.
Coroutines waiting on await token.wait() unblock on the next loop tick.
Daemon workers polling token.is_set() between iterations see the change on
their next poll. Workers that are blocking on a long operation can use
token.threading_event().wait(timeout) to block for at most timeout
seconds before checking again.
run_blocking propagates cancellation automatically: if the awaiting coroutine
is cancelled while the daemon thread is still running, run_blocking calls
cancel_token.set() so the daemon thread can observe the cancellation and
return early.
1.7 Summary: Three Rules
Every concurrency decision in the codebase follows three rules:
Default: asyncio on one thread. Services, handlers, UI controllers, and the bus all run on the asyncio loop, sharing one OS thread with Qt.
Foreign thread → main thread via ``call_soon_threadsafe``. The audio driver’s callback is the only foreign thread. It copies, schedules, and returns in microseconds.
Heavy blocking work → daemon thread via ``run_blocking``. Model inference, LLM generation, and OS input calls run on per-call daemon threads. Results return to the main thread through a resolved future.
With the concurrency model established, the next chapter — 1 Lifecycle — explains how all of these services, tasks, and threads are constructed, started, and torn down in the correct order.