Yes, you are right, CPUs do suck at timing on those scales, hence why you usually can't work with audio buffers smaller than some size, 256 samples on my system, for example (without ASIO, 2ms with it), otherwise underruns will occur.
The physical device may/does have its own crystal but to be honest, that's so low level, i don't really care about that; it's like caring about how high of a sampling rate the 1-bit DAC/ADC has. I set the sampling rate in my OS (which probably sets it on the soundcard) and probably in my DAW as well, although that may either do realtime resampling or use ASIO, directly talking with the soundcard, sidestepping any OS setting.
But yeah, overhead is imho high enough to not care too much; if you're using QSources, you can push individual samplepoints onto a very small buffer, and push those onto the Source, in effect, to the soundcard; increasing or decreasing the index by 1 smp (or more, depending on how big of a divergence you have / how fast of a convergence you want) will fix any drift that may arise.
Again, my biggest issue is/was that i wasn't sure how Video objects in Löve worked; now that i looked at it, a measly Video:play and Video:tell isn't the most accurate thing you could have; That said, you should test which works better for you; seeking the video based on the very precise audio position you can achieve with QSources and a SoundData buffer, or messing with the audio decoding, skipping or duplicating samplepoints based on info you get from Video:tell.
Again, hopefully no offense given, certainly none taken here; just that the discussion may have diverged a bit.
(Also, this way you wouldn't need to queue up exact audio frames either, since whether you'd load the one audio into a SoundData (which would take up large amounts of RAM) or just use a Decoder and decode into the small SoundData used as a QSource buffer instead (Decoders have such functionality from 0.11).
Finally, the main reason i'm not sold on not using QSources and using normal ones with :setPitch is that timing their starts is hard; you probably will have noticeable audio gaps, which i'm guessing isn't acceptable.