[Prompt]
Custom topic: Let's do an episode about LLM evals, focusing on what can actually be evaluated. We'll look at two specific areas: 1) quality — and how quality can be assessed (e.g., coherence, factuality, instruction-following, hallucination rates) and 2) technical parameters like inference speed and context window size, and how those can be benchmarked.

Let's also talk about the use cases for evals — who runs them, and why. And beyond just testing and verifying, an interesting wrinkle: can eval results differ depending on what GPU you're running on? Does hardware matter for reproducibility?

[Response]
Corn: If you’re deploying a Large Language Model in production today, how do you actually know it’s working? I don’t just mean "does it generate text," because we know they all do that. I mean, is it generating the right text, at the right speed, on the right hardware, without burning a hole in your budget or hallucinating a legal defense that doesn't exist?

Herman: It’s the million-dollar question, Corn. Or, depending on your compute bill, the ten-million-dollar question. Today’s prompt from Daniel is about the practical landscape of LLM evaluations. We’re moving past the era of "vibes-based" testing where a developer pokes a model five times and says, "Yeah, looks good to me," and moving into a world where evaluation is a compliance and cost necessity.

Corn: It’s about time. The "vibes" era was fun, but you can’t run a bank on vibes. By the way, fun fact for the listeners—Google Gemini 1.5 Flash is actually the one writing our script today, which is a meta-layer of evaluation right there. I’m Corn, the one who asks the annoying questions.

Herman: And I’m Herman Poppleberry, the one who stays up until three in the morning reading white papers on KV caching so you don't have to. 

Corn: He really does. It’s a problem. But let’s get into Daniel’s prompt. He wants us to look at two specific pillars: Quality—things like coherence and factuality—and Technical parameters like inference speed and context window size. And then there’s this fascinating wrinkle about whether the hardware itself—the actual GPU—changes the results of the evaluation.

Herman: That hardware piece is the "dark matter" of AI right now. People assume a model is a static mathematical object, but in reality, it’s a piece of software running on physical silicon, and that physical reality leaks into the results. But before we get to the silicon, we should probably define what "quality" even means in 2026. 

Corn: Right, because "quality" is subjective, isn't it? If I ask a model to write a poem, quality is one thing. If a doctor asks it to summarize a patient's history, quality is a life-or-death metric. How are we actually measuring things like coherence and instruction-following without just having a human sit there and grade it like a tired high school teacher?

Herman: We’ve moved toward "LLM-as-a-Judge" frameworks. It sounds circular, but you basically use a more powerful model—say, a GPT-5 or a Claude 4—to grade the output of a smaller, faster model. You give the judge a rubric, just like you would a human. For coherence, we look at things like perplexity, which is essentially a measure of how "surprised" a model is by a sequence of words. If the perplexity is low, the text flows logically. 

Corn: But wait—but what about the "safe" answer problem? If a model just repeats "I am a helpful assistant" over and over, its perplexity would be incredibly low because it's predictable, but the quality is zero. How do you distinguish between "logical flow" and "robotic repetition"?

Herman: That’s a sharp catch. Perplexity is a signal, not the whole story. To combat that "safe" output, we use "Semantic Diversity" scores. We measure how much the model varies its vocabulary and sentence structure. If the perplexity is low but the diversity is also low, the model is likely "collapsed" into a repetitive loop. It’s the difference between a smooth-talking orator and a broken record that only plays one note perfectly.

Corn: Okay, that makes sense. But what about the stuff that actually matters for reliability, like hallucination rates? That’s the big boogeyman.

Herman: Hallucination is where the RAG Assessment Series—or RAGAS—comes in. If you’re using Retrieval-Augmented Generation, where the model looks at your private data before answering, RAGAS measures "Faithfulness." It asks: "Is every claim in this answer supported by the retrieved context?" If the model says the company's Q3 revenue was fifty million, but the document it’s looking at says forty million, that’s a fail on faithfulness.

Corn: I like the idea of "Self-Check GPT" too. I saw a note about this where you ask the model the same question multiple times. If it gives you five different answers, it’s probably making it up. It’s like interviewing a suspect—if their story changes every time you ask, they’re lying.

Herman: That’s a great way to put it. High variance in a zero-temperature environment—or even at low temperature—is a massive red flag. And then you have instruction-following. We use benchmark suites like MT-Bench for that. It’s not just "can you answer this," but "can you answer this in exactly three bullet points, using only lowercase letters, while maintaining a professional tone?" It tests the model’s ability to stay within the guardrails you’ve set.

Corn: It’s funny because as humans, we’re actually pretty bad at that too. If you told me to speak only in lowercase, I’d probably mess it up within a minute. But for a model, that’s the difference between a helpful tool and a broken integration. Now, let’s flip to the nerdy side—the technical parameters. Because you can have the smartest model in the world, but if it takes forty-five seconds to start typing a response, your users are going to delete your app.

Herman: Well, not "exactly," but you're on the right track. In technical benchmarking, we look at two main things: throughput and latency. Throughput is your "Tokens Per Second"—the raw volume of text the model can churn out. Latency is "Time to First Token," or TTFT. For a chatbot, TTFT is the most important metric for "perceived" speed. If that first word pops up in two hundred milliseconds, the user feels like the AI is thinking with them. If it takes three seconds, it feels like the 1990s dial-up era.

Corn: How does that work in practice when you have multiple users? Does the TTFT stay the same if ten people are asking questions at once versus ten thousand?

Herman: That is where the "Concurrency Scaling" benchmark comes in. It’s not a flat line; it’s a curve. As you add more concurrent users, the GPU has to manage more "batches." Eventually, the memory bandwidth gets saturated, and your TTFT starts to climb exponentially. It’s like a restaurant kitchen—one chef can make one steak in ten minutes. But if a hundred people order steaks at the same time, the first steak might still take ten minutes, but the hundredth person is going to be waiting two hours because there’s only so much space on the grill.

Corn: And what about context window utilization? Every model now claims to have a "massive" context window—two hundred thousand tokens, a million tokens. But is that like a car that says it can go two hundred miles an hour, but it starts shaking and loses its bumper at eighty?

Herman: It’s exactly that. We use "Needle in a Haystack" tests to benchmark this. You bury a tiny, specific fact—the "needle"—in a massive document—the "haystack"—and ask the model to find it. What we’ve seen in recent research is a "lost in the middle" phenomenon. Models are great at remembering the beginning of the prompt and the very end, but their attention mechanisms often get "muddy" in the middle of a hundred-page document. 

Corn: Why does that happen? Technically speaking, if the math is consistent, why does the middle get lost?

Herman: It’s a function of how the attention scores are distributed. As the context grows, the model has to decide which tokens are relevant to the current token it’s generating. If you have a hundred thousand tokens, the "signal" of that one needle in the middle gets diluted by the "noise" of the ninety-nine thousand other tokens. It’s also a memory bandwidth issue. This is where KV caching—Key-Value caching—comes in. The model stores the "meaning" of previous tokens in memory so it doesn’t have to re-process the whole document every time it generates a new word. But that cache takes up massive amounts of VRAM. If your hardware can’t handle the cache efficiently, the model starts to struggle, and speed drops off a cliff.

Corn: So if it's a memory issue, does that mean a model with a "million token window" is actually a lie if you're running it on a consumer GPU with only 24GB of RAM?

Herman: It’s not a lie, but it’s a physical impossibility to use it effectively. To actually run a million tokens with a full KV cache for a model like Llama 3 70B, you might need hundreds of gigabytes of VRAM. If you don't have that, the system has to "offload" the cache to slower system RAM or even the SSD. At that point, your "Tokens Per Second" drops from eighty to maybe zero-point-five. You’re essentially watching the model think in slow motion.

Corn: So if I’m a developer, I’m looking at these two pillars. I want high quality—no hallucinations, good instruction following—and I want high performance—low TTFT and a reliable context window. Who is actually running these tests? Is it just the big labs like OpenAI and Anthropic trying to out-benchmark each other on Twitter?

Herman: They certainly do that for marketing, but the real growth in 2026 is in enterprise compliance. With the EU AI Act now in full effect, if you’re using an LLM for a "high-risk" application—like hiring, credit scoring, or medical advice—you are legally required to have a documented evaluation process. You can’t just say "it works." You need to show your hallucination rates, your safety guardrail success, and your bias metrics.

Corn: It’s basically the "Nutrition Facts" label for AI. "Warning: This model contains ten percent chance of making up legal precedents." 

Herman: Precisely. And for developers, it’s about regression testing. If you’re building a complex app and you switch from Llama 3 to Llama 3.1, or from GPT-4 to GPT-4o, you need to know that your existing prompts still work. It’s like software engineering—you don't push code without running unit tests. Evaluation is the unit testing of the AI world.

Corn: But wait—how do you do "unit testing" on something that is fundamentally non-deterministic? In traditional code, if I input 2, I expect 4. In LLMs, I input "Hello" and I might get "Hi," "Greetings," or "Howdy." Doesn't that make automated testing a nightmare?

Herman: It does, which is why we don't use "Exact Match" testing anymore. We use "Semantic Similarity." We use an embedding model to turn the AI's response into a vector—a string of numbers—and compare it to the vector of the "correct" answer. If the vectors are 99% similar in mathematical space, the test passes, even if the words are different. It’s like grading an essay based on the ideas rather than just checking for a specific keyword.

Corn: Okay, let’s get into the "wrinkle" Daniel mentioned. This blew my mind when I first heard it. We think of these models as math. Two plus two is four on a calculator, on a phone, on a supercomputer. But you’re telling me that if I run an evaluation for an LLM on an NVIDIA H100, I might get a different result than if I run it on an AMD MI300X? How is that possible?

Herman: It sounds like heresy in the world of computer science, but it’s real. It’s called numerical non-determinism. See, these models rely on floating-point arithmetic—usually FP16 or BF16. When you’re doing billions of calculations to generate a single token, tiny differences in how a specific GPU architecture handles rounding or how it groups mathematical operations can lead to slightly different "weights" in the final calculation. 

Corn: Wait, so a rounding error in the fourth decimal place on token number one can cascade?

Herman: Yes! It’s the butterfly effect in silicon. That tiny difference in the first token’s probability might change the word from "The" to "A." And because every subsequent token is based on all the previous ones, by the time you’re ten words in, the H100 version and the AMD version are writing completely different sentences. Even on the same GPU, if you change the batch size—how many prompts you’re processing at once—the software kernels might optimize the math differently. A study from late 2025 showed that Llama 3.1’s inference speed varied by fifteen percent between an A100 and an H100 purely due to tensor core optimizations. But more importantly, the actual "accuracy" scores on benchmarks shifted by one or two percent.

Corn: That’s a nightmare for reproducibility. If I’m a researcher and I publish a paper saying my new model gets an eighty-five percent on a benchmark, and you try to replicate it on a different hardware cluster and get an eighty-two percent, who’s right?

Herman: Both of you are right, which is the problem. It’s leading to what some are calling a "reproducibility crisis" in AI. In 2026, an evaluation score without a hardware spec is just a suggestion. We’re seeing a move toward "Hardware-Aware Evals." If you look at the top of the leaderboards now, you’ll see people specifying the exact GPU, the CUDA version, and even the quantization method used.

Corn: It’s like saying, "This car does zero to sixty in three seconds, but only if you’re at sea level on a Tuesday in July." It makes the benchmark feel a lot less universal. 

Herman: It does, but it’s the reality of the physics. And it’s not just about the "math" being different. The hardware affects the tradeoffs. For example, some GPUs have much better memory bandwidth but slower compute. On those chips, you might find that using a larger model with "quantization"—compressing it down from 16-bit to 4-bit—actually gives you better results than using a smaller "native" model, because the bottleneck is how fast you can pull data from memory, not how fast you can do the math.

Corn: This really changes the game for production. If I develop my bot on an A100 in a dev environment because that’s what was available, but then my company deploys it on an L40S because it’s cheaper for inference, the bot might actually behave differently?

Herman: It might. It might be slightly more prone to certain types of errors, or its "personality" might shift ever so slightly. That’s why the biggest takeaway for anyone doing this seriously is: benchmark on your target hardware. Don’t trust the numbers the model provider gives you in their marketing blog. Those numbers were likely generated on the most optimized, high-end hardware imaginable with specific batch sizes and custom kernels. Your real-world performance will vary.

Corn: It reminds me of those old video games where the physics were tied to the frame rate. If you had a faster computer, the character jumped higher and the game became impossible. We’re basically seeing the AI version of that. 

Herman: That’s a surprisingly good analogy, Corn. I'll allow it. 

Corn: Only one per episode, I know the rules. But let’s talk practicalities. If someone is listening to this and they’re tasked with evaluating a model for their company tomorrow, where do they start? You mentioned some frameworks—DeepEval, RAGAS, MT-Bench. Are these things a lone developer can actually use?

Herman: Most of these are open-source. For quality, I’d start with the LM Evaluation Harness from EleutherAI. It’s the industry standard for a reason—it’s a massive collection of benchmarks you can run locally. If you’re doing RAG, you have to use RAGAS. It’s the only way to get a quantitative grip on whether your retrieval is actually helping the model or just confusing it. 

Corn: But how do you handle the cost? If I'm using "LLM-as-a-Judge" and I have ten thousand test cases, I'm paying OpenAI or Anthropic to grade my own model. That could cost more than the actual inference!

Herman: That is the "Eval Tax." To avoid it, many developers are now using "Small Language Models" or SLMs as judges. You can fine-tune an 8-billion parameter model like Llama 3 specifically to be a grader. It won't be as smart as GPT-5 at writing poetry, but it can be trained to be a world-class expert at identifying "hallucinations in medical summaries." You run it locally on your own hardware, and your cost drops to near zero.

Corn: That’s clever. It’s like hiring a specialized intern instead of a high-priced consultant to check the homework. And for the technical side? The speed and the memory?

Herman: There are tools like "vLLM" or "TensorRT-LLM" that have benchmarking scripts built-in. They’ll tell you your tokens per second and your memory overhead. But the key is to build a "Golden Dataset." You need a set of five hundred to a thousand prompts that are specific to your business. Don't just rely on "general knowledge" tests. If your bot is supposed to summarize insurance claims, your evaluation should be five hundred insurance claims. Run those through your pipeline, grade them with a judge model—like GPT-4o—and then do that every single time you change a line of code or a piece of hardware.

Corn: It sounds like a lot of work, but I guess that’s the difference between a toy and a tool. You wouldn’t trust a bridge that wasn't stress-tested, so why trust an AI that’s going to be talking to your customers?

Herman: And the stakes are getting higher. We’re seeing "Model Drift" now, where even if you don’t change anything, the API provider might update the model behind the scenes—what they call a "stealth update"—and suddenly your prompts stop working as well. If you don’t have a continuous evaluation pipeline, you won’t even know your bot is failing until the customer complaints start rolling in.

Corn: It’s the "silent failure" problem. In traditional software, if something breaks, you get an error code. In AI, it doesn't give you an error code; it just starts giving slightly worse advice with the same level of confidence. 

Herman: That’s the most dangerous part of LLMs—the "confident wrongness." Evaluation is the only shield we have against it. And as we move into 2026, we’re seeing this expand into "Agentic Evals." It’s one thing to evaluate a model’s answer. It’s another thing to evaluate an AI agent that is allowed to click buttons, browse the web, and send emails. How do you evaluate a "process" rather than just a "response"?

Corn: That sounds like a whole different level of complexity. You’re not just grading the final essay; you’re grading the student’s entire research process.

Herman: It is. You have to look at "Success Rate," "Step Efficiency"—did the agent take ten steps when it could have taken two?—and "Tool Call Accuracy." If the agent is supposed to look up a flight but it accidentally deletes a calendar invite, that’s a failure even if the final answer is technically "correct."

Corn: This makes me think about the future of hardware. If hardware dependency is this big a deal, are we going to see models that are "compiled" for specific chips? Like, "This version of Llama is optimized for H100s," and you can't even run it on anything else if you want the benchmarked performance?

Herman: We’re already seeing it. NVIDIA’s TensorRT-LLM basically "compiles" a model for a specific GPU architecture to squeeze out every bit of performance. It makes the model faster, but it also locks you into that hardware ecosystem. It’s the classic "hardware-software co-design" problem. We actually touched on this way back in an earlier discussion about hardware-software co-design—it was episode fourteen-seventy if anyone wants to look at the broader trend. But in the context of evals, it means the "universal model" is becoming a myth. Everything is contextual.

Corn: So, we're moving toward a world where the model and the chip are a single unit. It's almost like a return to the console wars, but for enterprise AI.

Herman: Very much so. If you’re an Azure shop, your models are optimized for their custom Maia chips. If you’re on AWS, you’re using Trainium and Inferentia. The benchmark results you see on a public leaderboard might be completely irrelevant to your specific cloud stack. This is why "Internal Benchmarking" is becoming a core job description for AI Engineers. You can't outsource your truth.

Corn: So, the takeaways here for the folks at home. One: vibes are dead. If you’re not using a framework like RAGAS or MT-Bench, you’re just guessing. Two: hardware is not a neutral bystander. The silicon you choose affects the math and the speed, so benchmark on what you’re actually going to use. And three: evaluation is a loop, not a one-time checkbox.

Herman: I’d add a fourth: don’t ignore the humans entirely. Even with "LLM-as-a-Judge," you still need a human to "judge the judge" occasionally. You need to make sure your evaluation rubric actually aligns with what your users want. If the judge model says an answer is "high quality" but your users hate it, your eval is broken.

Corn: "Judge the judge." It’s turtles all the way down, Herman. 

Herman: It really is. But at least now we have better telescopes to see the turtles. 

Corn: That is a terrible metaphor to end on, but I’ll let it slide because we’re out of time. This was a deep one. I think a lot of people overlook the "science" part of data science when it comes to LLMs, but as Daniel’s prompt showed, the technical nuances—especially that hardware non-determinism—are where the real challenges are going to be for the next few years.

Herman: It’s the frontier. We’re moving from the "can we do it" phase to the "can we do it reliably and at scale" phase. And that’s where the real engineering begins.

Corn: Well, if you’re an engineer out there trying to figure out why your model is behaving differently on an A100 versus an H100, at least you know you’re not crazy. It’s just physics.

Herman: Physics and rounding errors. The two things that keep me up at night.

Corn: Among many others. Thanks for the deep dive, Herman. And thanks to Daniel for the prompt—this was one of those topics that feels academic until you’re the one responsible for a production rollout, and then it becomes everything.

Herman: It’s the difference between a prototype and a product. 

Corn: Well said. We’re going to wrap it there. Big thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power this show—ironically, we’re probably running on a variety of hardware ourselves, so who knows how our "coherence" is holding up.

Herman: If we start talking about the 1920s jazz scene for no reason, you’ll know we had a hardware drift.

Corn: I’d actually enjoy that. This has been My Weird Prompts. If you’re enjoying the show, a quick review on your podcast app really helps us reach more people who want to go deep on these topics.

Herman: Find us at myweirdprompts dot com for the full archive and ways to subscribe.

Corn: See you next time.

Herman: Take care.