About

Fangyi Yu is an Applied Scientist on the Foundational Research team at Thomson Reuters, where she specializes in large language model (LLM) evaluation. Her work spans autonomous evaluation pipelines, LLM- and agent-as-a-judge methodologies, and the assessment of AI agents in high-stakes domains. She builds evaluation frameworks that help ensure advanced language models are reliable, fair, and aligned with real-world requirements.

Fangyi holds an MSc in Computer Science from Ontario Tech University and a BSc in Applied Mathematics from Donghua University. Her background spans machine learning, natural language processing, and AI safety, with prior contributions to privacy-preserving systems and human–AI interaction research. She has previously worked at Coursera and the Human Machine Lab at Ontario Tech, contributing to projects in AI security and trust.

Her focus is on bridging research and practical applications to advance AI evaluation science — with the goal of supporting trustworthy AI deployment in high-stakes domains.

Skills & Expertise
Research Focus
  • LLM evaluation & benchmarking
  • LLM- and agent-as-a-judge methodologies
  • Post-training data design (SFT, DPO)
  • Natural language processing
  • AI safety & alignment
Languages & Frameworks
  • Python, SQL, Bash
  • PyTorch, TensorFlow
  • Hugging Face Transformers, TRL
  • spaCy, NLTK
  • Django, Flask
Cloud & MLOps
  • Amazon Bedrock, SageMaker
  • OpenAI & Anthropic APIs
  • Docker, Git, Linux
  • Weights & Biases
  • Tableau
AI Developer Tools
  • Claude Code
  • OpenAI Codex
  • Cline
  • GitHub Copilot
  • Cursor
Work Experience

Nov 2023 - Present

Thomson Reuters Foundational Research
Applied Scientist

  • In-house LLM development: contribute to model selection, post-training, and release gates for domain-specific models supporting legal research workflows.
  • Post-training data creation: design instruction-tuning and preference datasets using rubric-driven synthetic generation with human-in-the-loop QA.
  • Auto-evaluation pipelines: build evaluation harnesses that run on every model drop, with task suites, reproducible seeds for stable comparisons.
  • LLM-as-a-judge: implement multi-criteria rubric graders and multi-agent debate evaluation to reduce single-judge bias and improve reliability.
  • Agent evaluation: assess tool-using agents with metrics for task success, tool-call accuracy, latency, and failure recovery in sandboxed environments.
  • Cross-functional collaboration: partner with research, product, and legal subject-matter experts to translate evaluation results into model release criteria and product-ready guidance.

Jun 2023 - Sep 2023

Coursera
Machine Learning Engineer Intern

  • Built an enterprise propensity-to-purchase model combining binary classification for propensity scores with regression for ACV prediction, improving lead prioritization and sales targeting.
  • Performed extensive feature engineering and exploratory data analysis across firmographic, demographic, and engagement data sources.
  • Partnered with cross-functional stakeholders to scope requirements, communicate results, and iterate on modeling choices to maximize business impact.

Sep 2021 - Dec 2023

Human Machine Lab @ Ontario Tech University
Graduate Research & Teaching Assistant

The Human Machine Lab at Ontario Tech University is an interdisciplinary research group focused on designing computer systems around human needs and capabilities, with projects spanning human–computer interaction, usable security, privacy, and artificial intelligence.

Advised by Dr. Miguel Vargas Martin on applied machine learning research:

  • Published multiple peer-reviewed papers on applying machine learning techniques to authentication systems (see Google Scholar).
  • Designed GAN-based password guessing models that outperformed prior benchmarks by 83%, and developed a GPT-3-based honeyword generation technique to accelerate password-breach detection.
  • Conducted systematic literature reviews on machine learning applications in computer security.
  • Proposed novel approaches to strengthen the usability and security of password authentication systems using natural language processing.

May 2022 - Dec 2022

Thomson Reuters Labs
Applied Research Scientist Intern

Thomson Reuters Labs is the applied-research arm of Thomson Reuters, working with some of the world's most comprehensive legal, tax, and corporate datasets to advance AI for professional services.

Over an 8-month internship, I contributed to a legal text entailment research project and a named-entity recognition product initiative:

  • Co-authored two peer-reviewed papers exploring prompt engineering techniques for large language models on legal reasoning tasks.
  • Collaborated with research scientists to identify opportunities for applying state-of-the-art NLP methods to legal products.
  • Evaluated zero-shot, few-shot, chain-of-thought prompting, and fine-tuning strategies across GPT-3 and T5 using the Hugging Face and OpenAI APIs for domain-specific reasoning.
  • Benchmarked baseline and state-of-the-art models — including spaCy, Conditional Random Fields, and LegalBERT — on a highly imbalanced named-entity recognition dataset.
  • Maintained rigorous documentation of literature reviews, data processing, and experimental results following lab standards.

Oct 2019 - Jul 2020

AI Hub @ Durham College
Research Assistant

The AI Hub at Durham College partners with industry to deliver AI solutions that uncover business insights and drive productivity and growth.

  • Built interactive dashboards on historical gold-market data using Tableau.
  • Implemented machine learning models for time-series forecasting, with end-to-end documentation covering datasets, algorithms, APIs, data-flow diagrams, and optimization options.
  • Presented data-driven insights and recommendations to business stakeholders.
Education

Sep 2021 - Jan 2023

Master's Degree
Master of Science in Computer Science

Ontario Tech University (UOIT)

GPA: 4.24 on a scale of 4.30

Location: Ontario, Canada

Sep 2019 - Jun 2020

Post-graduate Certificate

Durham College

GPA: 4.83 on a scale of 5.00

Location: Ontario, Canada

Bachelor's Degree
Bachelor of Science in Applied Mathematics

Donghua University

Location: Shanghai, China

Selected Publications
When AIs Judge AIs paper preview
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Fangyi Yu. arXiv preprint, 2025.

paper
ACL 2023 paper preview
Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks

Fangyi Yu, Lee Quartey, Frank Schilder. Findings of the Association for Computational Linguistics (ACL), 2023.

paper
DIMVA 2023 paper preview
Honey, I Chunked the Passwords: Generating Semantic Honeywords Resistant to Targeted Attacks Using Pre-Trained Language Models

Fangyi Yu, Miguel Vargas Martin. Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), 2023.

paper
Legal Prompting paper preview
Legal Prompting: Teaching a Language Model to Think Like a Lawyer

Fangyi Yu, Lee Quartey, Frank Schilder. Natural Legal Language Processing Workshop (NLLP), 2022.

paper
HoneyGAN paper preview
HoneyGAN: Creating Indistinguishable Honeywords with Improved Generative Adversarial Networks

Fangyi Yu, Miguel Vargas Martin. European Symposium on Research in Computer Security — STM Workshop (ESORICS), 2022.

paper
GNPassGAN paper preview
GNPassGAN: Improved Generative Adversarial Networks For Trawling Offline Password Guessing

Fangyi Yu, Miguel Vargas Martin. IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 2021.

paper
Deep learning password survey preview
On Deep Learning in Password Guessing: A Survey

Fangyi Yu. arXiv preprint, 2022.

paper
Selected Articles (Full list)
Image
Research Methods Involving Human Studies

Towards Data Science, April 2022.

A walkthrough of the lifecycle of a human-subjects research experiment — from formulating hypotheses and designing the study, to running pilots, recruiting participants, collecting and analyzing data, and reporting results.

read more
Image
How to Build a Fake News Detection Web App Using Flask

Towards Data Science, August 2021.

A hands-on tutorial on framing fake-news detection as a binary classification problem, building an NLP model from scratch, and deploying it as a Flask web application.

read more
Image
A Thorough Guide to Time Series Analysis

Towards Data Science, July 2021.

A comprehensive guide to time-series analysis — covering core concepts and components, common statistical and machine-learning forecasting methods, and an end-to-end worked example predicting climate data.

read more