Jasmin Virdi Posted on May 31 Streaming an LLM response, in 4 GIFs # ai # webdev # javascript # tutorial Building TinyAgent (2 Part Series) 1 An LLM API call, in 4 GIFs 2 Streaming an LLM response, in 4 GIFs We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's .stream() method, it just worked and you probably never saw what was on the wire. This post will majorly focus on how a stream response works and how bugs are handled by SDK behind the hood. 1. Why Streaming exists To enable the streaming option we would need to make one change in the post request that is a single field "stream": true and it will change the response experience. Here are the pointers we take from the gif. The left side shows no streaming as the cursor blinks for 4 seconds then the whole response lands at once. The right side shows the streaming where the first word shows up in about 300 milliseconds. Words flow in as the model generates them. Both the sides have same model, same prompt, same total time it is just the right side started giving response almost 4 seconds earlier. The 4 seconds wait time for a full reply feels broken. A streamed reply that finishes in four seconds feels fast. Streaming doesn't make the model faster it makes the wait disappear. 2. What's on the wire When you set stream: true , the API stops sending a single JSON blob. It opens a persistent HTTP connection and pushes events down the line as the model generates them. The format is Server-Sent Events (SSE) a web standard. Any SSE debugger will read this stream. Here's what comes through: A few things to notice: The text lives in delta.text , nested inside content_block_delta events. Those are the events we should look after. stop_reason moved. In post 1 , we saw it right there in the response JSON. Here, it arrives at the very end inside a message_delta event, just before message_stop . If the loop bails out as soon as the text
LIVE
