And a really, really, "smart" system of this sort would buffer that first 1/4 or so second, feed it out (as you described) as part of the delayed stream, and then...
... and then, "speed up" the next few seconds of output until it catches up with the speaker's "real time". [a] [a] There are numerous ways of taking a voice stream and reducing the time length. This is done by, for example, radio stations to cut a 25 minute (say...) interview or talk fest down to 20 minutes, thus letting them pump in more commercials.
For better or worse, Rush Limbaugh was the most vocal, so to speak, opponent of this and he got the stations to stop playing with his broadcasts.