Solidigm presented (video here) at AIFD8 this month and as part of their presentation they spent time disecting what happens to a prompt, how token growth happens, and where storage can help speed up prompt processing.
The token count explosion

It all starts at a simple prompt something as simple as “run a benchmark against a drive” maybe a 12 token prompt but when it actually gets processed can balloon into something that’s much larger. As an LLM processes the prompt it goes through a number of steps building context, calling tools, obtaining and interpreting results, persisting knowledge and finally, responding to the prompt.

Digging a level deeper, here’s what the token counts look like during prompt processing. First step is to understand the environment of the prompt, rules, safety requirements, methodology at it’s disposal, then there’s retrieval activity that gathers information needed to actually process and perform the prompt, then identifying tools and their APIs needed to process the prompt, and at some point when the LLM has all that it plans out the steps needed to actually perform the prompt, tool results are generated, interpreted and fed back to LLM processing to determine next step. All of which at some point, prompt precessing completes and the prompt reply is sent back to the issuer.
As one can see in the above, the prompt itself was minuscule in token counts in the vast scheme of activity needed to process the prompt. And this is just how one (albeit complex), ~12 token prompt can grow into a 42K token context.
Inferencing and Time To First Token
Inferencing consists of two phases:
- PreFill phase – which is the processing that goes on to take the context token stream and convert it into a KV (Key:Value) store which the LLM can use for subsequent processing so it doesn’t have to go back to the token context. PreFill ends up with a fully populated KV store representing all the tokens in the current context, and generates the first token in the LLM response to the prompt
- Decode – which is all subsequent processing needed to generate the rest of the prompt response, uses the that KV store to underpin it’s processing to generate any more tokens needed to answer the prompt.
Solidigm went on to describe how these activities impact the Time To First Token (TTFT), or how long it takes from the time the prompt is issued until the LLM responds with the first word (token) of the prompt response.
(Although in the Solidigm’s chart they show Decode in the TTFT path. I believe this incorrect as PreFill generates the first token. Nonetheless, there is a portion of PreFill that “decodes” the prompt response first token and I assume that’s what they are showing here. Of course I could be mistaken.)

Storage can impact both the time it takes to assemble context tokens and to perform PreFill.
While storage can matter a lot during context assemble (lots of potential IO activity reading files, RAGs and other documents), storage’s impact on PreFill is less widely known. That is until you understand how prompt processing can be held up for KV store recalculation (going back to context tokens and rebuilding some or all of the current KV store for the prompt).
Increasing context, leads to more tokens, leads to larger KV stores, all of which impacts TTFT
Although, it’s only conjecture on my part, but the biggest portion of the Tprefill above seems to be calculating and converting context/memory tokens into KV elements stored in the prompts KV store. KV stores are used during prompt downstream processing because they can be easily accessed and each KV item represents intepreted token information in an easily used (by LLM) fashion.
And what’s not evident in the above TTFT decomposition chart is that tool use, generates even more tokens, as tool result (tokens), all of which need to be processed into more KV store elements in order to determine what to do next.
What happens to large KV stores during prompt processing
If there is a single GPU running a single prompt it’s possible, depending on model and HBM size, that it will run out of GPU HBM memory and offload or move some portion of its KV (store) cache to CPU memory. But if that GPU is processing 100s to 1000s of prompts concurrently, even CPU memory may not be large enough to hold every KV cache segment that no longer fits in GPU HBM. And of course most enterprise AI servers hold anywhere from 4 to 10 GPUs, each running 100s to 1000s of prompts concurrently.
KV cache offload is where fast storage can significantly speed up prompt processing
There’s an obvious tradeoff here with respect to KV stores. One can always go back to the Prefill phase, reread all the tokens in current context and recompute the KV store or one can offload KV store segments to memory, local storage or network storage and later retrieve the already computed KV store from wherever it ended up.
The tradeoff is how long it takes to recompute vs do the data transfers to offload and retrieve the KV cache segments. Larger contexts, increase KV store size, which lead to more need to offload or jettison KV store segments when running out of GPU HBM space. Both KV caching to memory-storage vs jetisoning KV store segments and reconstituting them, add time to TTFT. The question is which is faster.
One can see how this would be made ever more of an issue as prompts token counts (& KV elements) skyrocket. Also when more prompts are running concurrently on the same GPU(s) in a single server.
Obviously local, large SSDs with very fast random read would be ideal for KV cache offload activity which has the KV cache segment written out once (and extended as prompt processing adds context) but read back multiple times. Which s is great application for Large capacity, fast read NVMe SSDs which, I must say, are Solidigm’s forte.
NVIDIA and others have started to add KV cache offloading to their inferencing stacks. As they do, large fast NVMe SSDs activity during AI prompt processing will become one of the critical factors in TTFT.
In the meantime, if anyone has any large, fast NVMe SSDs they don’t need anymore, please let me know. 🙂