Gen-AI: Observability Vs. Evaluation Vs. Interpretability

Jun 10, 2025 | AI/ML, Insights

While working in the field of Generative-AI, end-users and developers alike can often come across terms like Observability, Evaluation, Interpretability. This blogpost will briefly differentiate among these terms.

Observability: In simple words, this term is used when one store the input/output of a language model in a database as opposed to just remembering it in his/her brain. One can argue that why to spend so much time to write a code to do that and instead it is easy to just remember what was asked and what was replied by LLM for a one, two, three query-cycles. That argument holds true for short answers or easily quantifiable answers with few queries, but it becomes impossible tasks when there are 100s of query cycles involved and answers are difficult to quantify or impossible to absolutely quantify. There are multiple tools that helps you to do these tasks, so you don’t have to write the codes and it’s interface by yourself [1]. These tools are very matured which provides basic functionality of taking care of mapping input-output along with multiple other downstream tasks. It is a developer framework and does not face the apps’ end-user.

Evaluation: On the other side, when a LLM is evaluated through either by humans or other models; the process is called Evaluation. Evaluation helps us to tell how much a generated answer differs from other generated answer(s). Like Observability, there are multiple tools that can help you perform evaluation on Gen-AI applications [2]. This could be both developer and apps’ end-user tool.

Interpretability: Interpretability (aka Mechanistic Interpretability, Alignment etc.) is a process to look inside the LLM’s working as opposed to Observability and Evaluation, where the focus is outside of LLM. Interpretability is a practice of finding the most active neurons, from billions or potentially trillions of neurons of the underlying neural network, for a given inquiry. Once they are identified, multiple downstream tasks can be performed. Interpretability is often used to rationalize AI/ML/LL-models’ output or to manipulate it to tune the model towards a particularly desirable behavior through actively controlling the “firings” of the neurons by using various tools. Interpretability can be performed by multiple frameworks e.g. Sparse Autoencoders (SAEs), Circuit Logit Analysis etc. [3]. Though this approach is primarily used by developers, but it has a significant potential to be used in front-end to increase the end-user’s trust in the generated output.

References:

[1] https://docs.llamaindex.ai/en/stable/module_guides/observability/

[2] https://python.langchain.com/api_reference/langchain/evaluation.html

[3] https://www.anthropic.com/research/tracing-thoughts-language-model