Saeid A. Taghanaki

Principal Scientist & Manager, Microsoft
Adjunct Research Professor, Simon Fraser University
asgt.saeid at gmail dot com
Portrait of Saeid A. Taghanaki

About

I am a Principal Scientist & Manager at Microsoft, where I lead a team building agentic evals for Copilot Tuning.

My research treats evaluation as a central constraint on foundation-model reliability. Static benchmarks, with their closed-form answers, describe a regime distant from how models are deployed: real-world systems operate over open-ended tasks, multi-step trajectories, and continuous optimization pressure, often without a fixed answer key. I focus on two connected threads, expanded below: agentic evaluation and capability assessment for long-horizon self-improving systems, and detection of memorization in generative models.

Before Microsoft I worked on reliable AI at Autodesk Research, where I contributed to Generative AI Content Authentication at the U.S. AI Safety Institute Consortium (NIST AISIC). I completed my PhD at SFU under Prof. Ghassan Hamarneh, with research visits at MILA (Montréal) and Siemens Healthineers (Princeton).

Research

My work sits at the intersection of evaluation, self-improving and long-horizon agents, and foundation-model reliability. The threads below share a common framing: as models grow more capable, evaluation becomes a more central constraint on real-world reliability.

Agentic evaluation and capability assessment

I lead a team at Microsoft building the agentic evaluation infrastructure for Copilot Tuning, a closed-loop system that connects customer-facing evaluation to post-training: reinforcement-learning fine-tuning and inference-time optimization. The work focuses on long-horizon, tool-using agents and on evaluation signals suitable for self-improving systems.

Two questions sit at the core of this work. The first is what to measure. As models scale, surface accuracy can decouple from underlying capability: models sometimes pass tests by exploiting structure in the test rather than by reasoning. The second is what happens to evaluation once a system is being trained to do well on it. Rubrics, judges, and synthetic evaluators can drift, saturate, or get gamed under sustained optimization pressure.

Several of our recent papers address pieces of these two questions. MMLU-Pro+ (NeurIPS 2024) studies shortcut-selection in benchmark design, where models can score well on MMLU without the underlying reasoning the benchmark is intended to test. In Explain-Query-Test (EQT) (ICLR 2025) we consider whether models can evaluate themselves without human labels, using a structural asymmetry: a model that can explain a concept well should also be able to comprehend its own explanation, and the gap between the two can serve as a label-free capability signal. In GoalCover (arXiv 2026) we look at the question upstream of training, decomposing a high-level goal into subgoals to help identify capability gaps in fine-tuning data before training. In SibylSense (arXiv 2026) we consider the rubric side, adapting a rubric generator through a tunable memory bank and an adversarial policy loop, with the goal of slowing rubric saturation under optimization.

Memorization and content provenance

Generative models can memorize their training data and reproduce parts of it, sometimes verbatim. This raises copyright, privacy, and reliability concerns that often overlap.

In Detecting Generative Parroting (CVPR 2024) we proposed a method using overfit masked autoencoders to flag probable verbatim reproduction at inference time, without needing access to the training set. A longer discussion (why memorization happens, how to detect it, and what to do about it) appears in a three-part essay series I wrote for Autodesk Research (part 1, part 2, part 3). Related to this work, I contributed to Generative AI Content Authentication at the U.S. AI Safety Institute Consortium (NIST AISIC).

Vision-language and multimodal understanding

In Determining the Preferred Image Distribution of a Black-Box VLM (NeurIPS 2024) we proposed a probing method that estimates a VLM's input preferences without access to weights or training data, with potential use in domains where input distribution matters (CAD, medical, industrial imagery). On the generation side, in SLiMe (ICLR 2024) and SMITE (ICLR 2025) we optimize textual embeddings for single-shot segmentation in images and video respectively. Earlier work on bridging vision and explanation appears in Learned Visual Features to Textual Explanations (ICLR 2024).

Robustness and earlier foundations

Earlier threads that inform how I think about evaluation today. In MaskTune (NeurIPS 2022) we proposed a method that addresses spurious correlations by selectively masking salient features, encouraging models to rely on alternative discriminative signals. In Robust Representation Learning via Perceptual Similarity Metrics (ICML 2021, Spotlight) we proposed perceptual similarity as a robustness objective for self-supervised representations. In A Kernelized Manifold Mapping (CVPR 2019) we studied adversarial perturbations through learned manifold projection.

My PhD focused on reliable deep learning for medical imaging. Work from that period includes a Deep Semantic Segmentation review (AI Review 2020), which has been widely cited in medical-image segmentation. Earlier, InfoMask (MICCAI 2019, Early Accept) used masked variational latent representations for weakly-supervised chest disease localization during a research visit at MILA.

Service

Selected Talks