Two developer write-ups describe using ONNX Runtime on a CPU to split a short two-speaker, 14.171-second conversation into smaller audio segments. One approach uses Silero VAD (voice activity detection) to keep only periods classified as speech. The MP3 is decoded to a 16 kHz mono waveform, processed in 32 ms chunks, and run with CPUExecutionProvider. Using a start threshold of 0.5, an end threshold of 0.35, minimum silence of 100 ms, minimum speech of 250 ms, and 30 ms padding, the detector produces 12 speech intervals and saves them as separate WAV files. The combined speech segments total about 11.917 seconds of audio.

The second approach uses Pyannote Segmentation 3.0 in ONNX form to detect speaker changes rather than general speech presence. The model runs on 10-second windows with 16 kHz input. Frames are labeled using probabilities for speaker activity (with a 0.5 active threshold and overlap margin rule), followed by post-processing that merges short states and assigns silent frames to adjacent speakers. This results in 6 utterance-level WAV segments that cover the full input without gaps or overlaps. Both tests report real-time factors well below 1x for inference/segmentation on a Mac Studio and note that the speaker labels are not guaranteed to be consistent identities across longer recordings.