About

I am Fangyi Yu, an Applied Scientist at Thomson Reuters, with a strong foundation in Computer Science from Ontario Tech University and Applied Mathematics from Donghua University.

My expertise lies in machine learning, privacy, and natural language processing. At Thomson Reuters, I apply generative AI to legal research products. My career includes impactful experiences at Coursera and the Human Machine Lab at Ontario Tech, contributing to significant advancements in AI and security.

Basic Information
Research:
5+ years
Python:
7+ years
Email:
fangyi.yu@ontariotechu.net
Location:
Ontario, Canada
Language:
English, Chinese. Learning French and Spanish
Hobby:
Travelling, Table Tennis, Gym & Workout
Skills
Python
PyTorch
HuggingFace
Tableau
TensorFlow
spaCy
Amazon Bedrock
NLTK
Linux
Git
Amazon SageMaker
Django
Work Experience

Nov 2023 - Present

Thomson Reuters
Applied Scientist

  • Build in-house LLMs.
  • Apply generative AI to legal, tax, and corporate research software.

Jun 2023 - Sep 2023

Coursera
Machine Learning Engineer Intern

  • Developed an enterprise propensity to purchase model, involving binary classification for propensity scores and regression for ACV prediction, effectively enhancing lead prioritization and sales targeting.
  • Performed extensive feature engineering and EDA on diverse data sources and natures, including firmographic, demographic, and engagement data.
  • Collaborated with cross-functional teams to clarify doubts and ambiguity, communicated progress and results effectively, and embrace innovative approaches to improve model performance and business impact.

Sep 2021 - Dec 2023

Human Machine Lab @ Ontario Tech University
Graduate Research Assistant and Teaching Assitant

The Human Machine Lab at Ontario Tech University is an interdisciplinary research laboratory with overarching yet practical ambition: designing computer systems around human needs and capabilities while maintaining human-level intelligence. Its interdisciplinary projects fall in various fields including human-computer interaction, usable security, privacy, and artificial intelligence.

During my stay at the lab, I have the opportunity to work with Dr. Miguel Vargas Martin on various applied machine learning projects, and

  • Published multiple papers in the topic of applying machine learning techniques to authentication system (see my Google Scholar).
  • Created GANs-based password guessing models that surpassed the benchmark’s performance by 83%, and GPT-3-based honeyword generation technique to accelerate password breach detection process.
  • Conducted literature reviews on implementations of machine learning algorithms in computer security through researching academic databases.
  • Generated novel ideas for enhancing the usability and security of password authentication systems via the use of natural language processing.

May 2022 - Dec 2022

Thomson Reuters Labs
Applied Research Scientist Intern

Thomson Reuters (TR) is a content-driven technology company with over a century of experience curating and classifying data and supporting professionals in the legal, tax, and coporate domains. Embedded in this legacy is a team of AI-focused engineers, research, data scientists, and designers who work with some of the most comprehensive and richly enhanced legal, tax, and other professional datasets in the world.

During my 8-month internship at TR, I worked on a legal text entailment reseach project and a name entity recognition product-oriented project. My responsibilities and achivements include:

  • Published two papers exploring multiple prompt engineering techiques for large language models in tackling legal reasoning tasks.
  • Collaborated with other research scientists to brainstorm applications of novel techniques in NLP on legal products.
  • Experimented with various state-of-the-art approaches (zero-shot, few-shot, chain-of-thought promting, fine-tuning, prompt-engineering, etc.) on different large language models (GPT-3, and T5) using Huggingface and OpenAI APIs for domain-specific reasoning.
  • Experimented with baseline and cutting-edge statistical and machine learning models for a highly imbalanced name entity recognition project. Models such as spaCy, Conditional Random Field and LegalBert were used.
  • Documented project updates, including literature review, data exploration and processing, experimentation and testing results in lab standards on a daily basis.

Oct 2019 - Jul 2020

AI Hub @ Durham College
Research Assistant

The AI Hub at Durham College offers industry partners access to AI solutions to uncover business insights and increase companys' productivity and growth.

During my stay at AI Hub, I

  • Generated interactive dashboard with historical gold data using Tableau.
  • Implemented machine learning models for time series data prediction and documented the project including specifications about datasets, algorithms, APIs, data flow diagrams, comparisons for different scenarios and future optimization approaches.
  • Proposed data insights to business stakeholders.

Education

Sep 2021 - Jan 2023

Master's Degree
Master of Science in Computer Science

Ontario Tech University (UOIT)

GPA: 4.24 on a scale of 4.30

Location: Ontario, Canada

Sep 2019 - Jun 2020

Post-graduate Certificate

Durham College

GPA: 4.83 on a scale of 5.00

Location: Ontario, Canada

Bachelor's Degree
Bachelor of Science in Applied Mathematics

Donghua University

Location: Shanghai, China

Selected Publications
Image
Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks

Fangyi Yu, Lee Quartey, and Frank Schilder. Findings of the Association for Computational Linguistics: ACL 2023.

paper
Image
Honey, I Chunked the Passwords: Generating Semantic Honeywords Resistant to Targeted Attacks Using Pre-Trained Language Models

Fangyi Yu, Miguel Vargas Martin. Conference on detection of intrusions and malware & vulnerability assessment: DIMVA 2023.

paper
Image
Legal Prompting: Teaching a Language Model to Think Like a Lawyer

Fangyi Yu, Lee Quartey, and Frank Schilder. Proceedings of the Natural Legal Language Processing Workshop: NLLP 2022.

paper
Image
HoneyGAN: Creating Indistinguishable Honeywords with Improved Generative Adversarial Networks

Fangyi Yu, Miguel Vargas Martin. In European symposium on research in computer security (ESORICS) workshop: STM 2022.

paper
Image
GNPassGAN: Improved Generative Adversarial Networks For Trawling Offline Password Guessing

Fangyi Yu, Miguel Vargas Martin, IEEE European symposium on security and privacy workshops: EuroS&PW 2021.

paper
Image
On Deep Learning in Password Guessing, a Survey.

Fangyi Yu. ArXiv pre-print 2022.

paper
Selected Articles (Full list)
Image
Research Methods Involving Human Studies

Towards Data Science, Apr, 2022.

This article discusses the lifecycle of a research experiment involving human subjects, including identifying the research hypothesis, defining the study design, conducting a pilot study to evaluate the research's design, system, and instruments, recruiting participants, conducting the actual data collection sessions, analyzing the data, and reporting the results.

read more
Image
How to Build a Fake News Detection Web App Using Flask

Towards Data Science, Aug 2021.

The spread of fake news is unstoppable with the adoption of different social networks. On Twitter, Facebook, Reddit, people take advantage of fake news to spread rumours, win political benefits and click rates. Detecting fake news is critical for a healthy society. From a machine learning standpoint, fake news detection is a binary classification problem; hence we can use traditional classification methods or state-of-the-art Neural Networks to deal with this problem. This tutorial creates a natural language processing application from scratch and deploy it on Flask.

read more
Image
A Thorough Guide to Time Series Analysis

Towards Data Science, Jul 2021.

This article provides a comprehensive guide to time series analysis, including topics such as what time series data is? what time series data components are? how time series data is used? what is the purpose of time series analysis? The most commonly used time series forecasting approaches (statistical and machine learning), as well as an end-to-end example of predicting climatic data using a machine learning model.

read more
Fangyi Yu

© Creative CV. All rights reserved.
Design - TemplateFlip