Saeid A. Taghanaki
I am a Principal Scientist & Manager at Microsoft, where I lead a team building agentic evaluation infrastructure for Copilot Tuning.
My research treats evaluation as a central constraint on foundation-model reliability. Static benchmarks with their closed-form answers, describe a regime distant from how models are deployed: real world systems operate over open-ended tasks, multi-step trajectories, and continuous optimization pressure, often without a fixed answer key. I focus on three connected threads, expanded below: capability assessment that distinguishes reasoning from shortcut learning, evaluation of long-horizon agents under their own optimization pressure, and detection of memorization in generative models.
Before Microsoft I worked on reliable AI at Autodesk Research, where I contributed to Generative AI Content Authentication at the U.S. AI Safety Institute Consortium (NIST AISIC). I completed my PhD at SFU under Prof. Ghassan Hamarneh, with research visits at MILA (Montréal) and Siemens Healthineers (Princeton).
Research
My work sits at the intersection of evaluation, self-improving and long-horizon agents, and foundation-model reliability. The threads below trace a single conviction: as models grow more capable, evaluation becomes the binding constraint on real-world reliability. They span capability assessment, agent infrastructure, memorization, and the failure modes that emerge under optimization pressure.
Capability and self-evaluation
Standard accuracy benchmarks tell us less than they used to. As models scale, surface accuracy increasingly decouples from real capability. Models pass tests by exploiting structure in the test rather than by reasoning. Two recent threads of mine push back from different angles.
In MMLU-Pro+ NeurIPS 2024 we built a benchmark hardened against the kind of shortcut-selection that inflates leaderboard numbers without the underlying reasoning. Models that look strong on MMLU often pick the right answer for the wrong reason; MMLU-Pro+ measures and penalizes that explicitly.
In Explain-Query-Test (EQT) ICLR 2025 we asked whether models can evaluate themselves without human labels. The method exploits a structural asymmetry: a model that can explain a concept well should also be able to comprehend its own explanation. Where the gap opens, the model's understanding is shallower than its surface fluency suggests. EQT turns this into a label-free capability signal, useful both as evaluation and as training signal for self-improving systems.
Agentic evaluation at scale
I lead a team at Microsoft building the agentic evaluation infrastructure for Copilot Tuning, a closed-loop system that connects customer-facing evaluation to post-training: reinforcement-learning fine-tuning and inference-time optimization. The work targets long-horizon, tool-using agents and the evaluation signals that make self-improvement possible without constant human supervision.
How do you produce trajectory-, outcome-, and process-level signals for long-horizon agents that don't collapse under their own optimization pressure? How do rubrics, judges, and synthetic evaluators interact when used as training signal in self-improving systems rather than just as measurement? When does evaluation become the bottleneck, and when does it become the signal that gets gamed?
Memorization and content provenance
Generative models memorize their training data and reproduce it, sometimes verbatim. This is simultaneously a copyright issue, a privacy issue, and a reliability issue.
In Detecting Generative Parroting CVPR 2024 we introduced a method using overfit masked autoencoders to flag probable verbatim reproduction at inference time, without needing access to the training set. The full argument (why memorization happens, how to detect it, and what to do about it) appears in a three-part essay series I wrote for Autodesk Research (part 1, part 2, part 3). Related to this work, I contributed to Generative AI Content Authentication at the U.S. AI Safety Institute Consortium (NIST AISIC).
Vision-language and multimodal understanding
In Determining the Preferred Image Distribution of a Black-Box VLM NeurIPS 2024 we introduced a probing method that surfaces a VLM's input preferences without access to weights or training data, useful for safe deployment in domains where input distribution matters (CAD, medical, industrial imagery). On the generation side, SLiMe ICLR 2024 and SMITE ICLR 2025 optimize textual embeddings for robust single-shot segmentation in images and video respectively. Earlier work on bridging vision and explanation appears in Learned Visual Features to Textual Explanations ICLR 2024.
Robustness and earlier foundations
Earlier threads that inform how I think about evaluation today. MaskTune NeurIPS 2022 mitigates spurious correlations by selectively masking salient features, forcing models to discover alternative discriminative signals. Robust Representation Learning via Perceptual Similarity Metrics ICML 2021 proposed perceptual similarity as a robustness objective for self-supervised representations (Spotlight). A Kernelized Manifold Mapping CVPR 2019 addressed adversarial perturbations through learned manifold projection.
My PhD focused on reliable deep learning for medical imaging. The most-cited contribution from that period is a Deep Semantic Segmentation review AI Review 2020, now a standard reference in medical-image segmentation. Earlier, InfoMask MICCAI 2019 used masked variational latent representations for weakly-supervised chest disease localization during a research visit at MILA (Early Accept).
Service & Impact
Selected Talks
- 2025Towards More Reliable Generative AI: Evaluation and Mitigation Strategies at Microsoft
- 2024The Parroting Problem of Generative AI at Autodesk TechX
- 2022Spurious Correlations in Computer Vision at Sony AI
- 2022Input Space Modifications for Reducing Spurious Correlations at Google Research
- 2021Robust Representation Learning via Perceptual Similarity Metrics at ICML
- 2020Towards Interpretable and Bias-Resilient Point Cloud Processing at ICML