
Streaming improves UX but makes token usage invisible. Here’s how we fixed it using C# and Semantic Kernel.
Summary
If you’re building production-ready AI applications with Semantic Kernel and using providers like OpenAI, Gemini, or Claude, there’s a hidden problem when you enable streaming: token usage data disappears.
This is a major issue if you’re tracking consumption for billing, enforcing usage limits, or analyzing system performance.
At HyperTrends, we solved this problem in our C# implementation using Semantic Kernel by collecting streaming output manually and calculating token usage ourselves. Here’s exactly how we did it.
The Problem: Streaming Removes Token Metadata
When using InvokeStreamingAsync() in Semantic Kernel or other streaming methods, the response is delivered in parts for a smoother user experience. However, the tradeoff is that token metadata is not returned by most LLM APIs in this mode.
For example:
- OpenAI and Azure OpenAI omit the usage field during streaming.
- Anthropic Claude behaves the same. Token counts are not returned during a streamed response.
- Google Gemini / PaLM also does not include token usage in real-time.
As a result, Semantic Kernel cannot populate FunctionResult.Metadata[“Usage”], which is normally available for non-streaming calls.
If you are relying on that metadata for tracking, streaming mode breaks your pipeline.
✅ The Solution: Accumulate the Output and Count Tokens Manually
Since the metadata is not provided, the solution is to reassemble the full prompt and completion, and then count the tokens using model-specific tokenizers.
1. Accumulate the Output
During streaming, append the output fragments to a buffer:
var sb = new StringBuilder();
await foreach (var chunk in kernel.InvokeStreamingAsync(...))
{
sb.Append(chunk);
}
string fullOutput = sb.ToString();
2. Use the Right Tokenizer
Depending on the model, use a compatible tokenizer. For OpenAI models, you can use tiktoken or a C# port.
var encoding = Tiktoken.Encoding.ForModel("gpt-4");
int promptTokens = encoding.CountTokens(promptText);
int completionTokens = encoding.CountTokens(fullOutput);
Different models use different tokenization methods. Claude and Gemini will require custom logic or official APIs to estimate or calculate token counts accurately.
3. Store and Use the Counts
Once you have the numbers, store them alongside the conversation, use them for billing, or log them for analytics.
Provider-Specific Considerations
Provider | Token Usage in Streaming | Manual Counting Needed | Notes |
---|---|---|---|
OpenAI / Azure | No | Yes | Use tiktoken |
Claude | No | Yes | Has token count API for prompts |
Gemini / PaLM | No | Yes | Use countTokens endpoint |
Tips for Accuracy
- Always perform token counting after the full stream completes.
- Keep track of which model was used so you apply the correct tokenizer.
- Consider hidden system content like function calls in prompts when calculating total usage.
How Semantic Kernel Handles This
Call Type | Token Usage Available? | Recommendation |
---|---|---|
InvokeAsync() | Yes (in .Metadata) | Use as-is for full metadata |
InvokeStreamingAsync() | No | Use custom counting logic |
Semantic Kernel does a good job extracting token usage for non-streamed completions, but it cannot extract what the provider doesn’t return. That’s why manual token counting is needed when streaming.
FAQs
Why is token usage missing during streaming?
Because providers optimize for low latency, they skip metadata in streamed payloads. This makes the data fast to deliver but incomplete for logging and billing.
Can Semantic Kernel calculate token usage on its own?
Only in non-streamed mode. For streamed output, you must calculate it manually using a tokenizer.
Is manual counting accurate?
Yes, if you use the official tokenizer for the model. There may be minor differences due to system messages or encoding quirks, but it’s close enough for most billing and tracking scenarios.
Does Claude or Gemini offer token counting tools?
Claude has a token counting API for prompts. Gemini includes a countTokens
endpoint to help with estimates. Neither provides real-time token usage during streaming.
Does this work with function calling and plugins?
Yes, but be careful. Plugins and function calls may inject additional hidden content into the prompt. That affects token count, so always tokenize the full content, not just what the user sees.
Conclusion
Streaming AI responses is essential for delivering great user experiences. However, it comes at a cost: token usage data is not available by default.
At HyperTrends, we solved this problem by building a manual counting system using Semantic Kernel, prompt tracking, and model-specific tokenizers. This gives us full visibility into usage across OpenAI, Claude, and Gemini—even when streaming is enabled. Want to build AI SaaS platforms, or Agents to help you grow your business? Talk to us!
Until providers begin returning token metadata with streaming output, this is the most reliable method to keep your application accurate, auditable, and scalable.