Fangyi Yu

About

I am Fangyi Yu, an Applied Scientist at Thomson Reuters, with a strong foundation in Computer Science from Ontario Tech University and Applied Mathematics from Donghua University.

My expertise lies in machine learning, privacy, and natural language processing. At Thomson Reuters, I apply generative AI to legal research products. My career includes impactful experiences at Coursera and the Human Machine Lab at Ontario Tech, contributing to significant advancements in AI and security.

Basic Information

Research:

6+ years

Python:

8+ years

Email:

yufangyi1@gmail.com

Location:

Ontario, Canada

Language:

English, Chinese. Learning French and Spanish

Hobby:

Travelling, Table Tennis, Gym & Workout

Skills

Python

PyTorch

HuggingFace

Tableau

TensorFlow

spaCy

Amazon Bedrock

NLTK

Linux

Git

Amazon SageMaker

Django

Work Experience

Nov 2023 - Present

Thomson Reuters

Applied Scientist

Build in-house LLMs.
Apply generative AI to legal, tax, and corporate research software.

Jun 2023 - Sep 2023

Coursera

Machine Learning Engineer Intern

Developed an enterprise propensity to purchase model, involving binary classification for propensity scores and regression for ACV prediction, effectively enhancing lead prioritization and sales targeting.
Performed extensive feature engineering and EDA on diverse data sources and natures, including firmographic, demographic, and engagement data.
Collaborated with cross-functional teams to clarify doubts and ambiguity, communicated progress and results effectively, and embrace innovative approaches to improve model performance and business impact.

Sep 2021 - Dec 2023

Human Machine Lab @ Ontario Tech University

Graduate Research Assistant and Teaching Assitant

The Human Machine Lab at Ontario Tech University is an interdisciplinary research laboratory with overarching yet practical ambition: designing computer systems around human needs and capabilities while maintaining human-level intelligence. Its interdisciplinary projects fall in various fields including human-computer interaction, usable security, privacy, and artificial intelligence.

During my stay at the lab, I have the opportunity to work with Dr. Miguel Vargas Martin on various applied machine learning projects, and

Published multiple papers in the topic of applying machine learning techniques to authentication system (see my Google Scholar).
Created GANs-based password guessing models that surpassed the benchmark’s performance by 83%, and GPT-3-based honeyword generation technique to accelerate password breach detection process.
Conducted literature reviews on implementations of machine learning algorithms in computer security through researching academic databases.
Generated novel ideas for enhancing the usability and security of password authentication systems via the use of natural language processing.

May 2022 - Dec 2022

Thomson Reuters Labs

Applied Research Scientist Intern

Thomson Reuters (TR) is a content-driven technology company with over a century of experience curating and classifying data and supporting professionals in the legal, tax, and coporate domains. Embedded in this legacy is a team of AI-focused engineers, research, data scientists, and designers who work with some of the most comprehensive and richly enhanced legal, tax, and other professional datasets in the world.

During my 8-month internship at TR, I worked on a legal text entailment reseach project and a name entity recognition product-oriented project. My responsibilities and achivements include:

Published two papers exploring multiple prompt engineering techiques for large language models in tackling legal reasoning tasks.
Collaborated with other research scientists to brainstorm applications of novel techniques in NLP on legal products.
Experimented with various state-of-the-art approaches (zero-shot, few-shot, chain-of-thought promting, fine-tuning, prompt-engineering, etc.) on different large language models (GPT-3, and T5) using Huggingface and OpenAI APIs for domain-specific reasoning.
Experimented with baseline and cutting-edge statistical and machine learning models for a highly imbalanced name entity recognition project. Models such as spaCy, Conditional Random Field and LegalBert were used.
Documented project updates, including literature review, data exploration and processing, experimentation and testing results in lab standards on a daily basis.

Oct 2019 - Jul 2020

AI Hub @ Durham College

Research Assistant

The AI Hub at Durham College offers industry partners access to AI solutions to uncover business insights and increase companys' productivity and growth.

During my stay at AI Hub, I

Generated interactive dashboard with historical gold data using Tableau.
Implemented machine learning models for time series data prediction and documented the project including specifications about datasets, algorithms, APIs, data flow diagrams, comparisons for different scenarios and future optimization approaches.
Proposed data insights to business stakeholders.

Education

Sep 2021 - Jan 2023

Master's Degree

Master of Science in Computer Science

Ontario Tech University (UOIT)

GPA: 4.24 on a scale of 4.30

Location: Ontario, Canada

Sep 2019 - Jun 2020

Post-graduate Certificate

Graduate Certificate in Artificial Intelligence

Durham College

GPA: 4.83 on a scale of 5.00

Location: Ontario, Canada

Bachelor's Degree

Bachelor of Science in Applied Mathematics

Donghua University

Location: Shanghai, China

Selected Publications

Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks

Fangyi Yu, Lee Quartey, and Frank Schilder. Findings of the Association for Computational Linguistics: ACL 2023.

paper

Honey, I Chunked the Passwords: Generating Semantic Honeywords Resistant to Targeted Attacks Using Pre-Trained Language Models

Fangyi Yu, Miguel Vargas Martin. Conference on detection of intrusions and malware & vulnerability assessment: DIMVA 2023.

paper

Legal Prompting: Teaching a Language Model to Think Like a Lawyer

Fangyi Yu, Lee Quartey, and Frank Schilder. Proceedings of the Natural Legal Language Processing Workshop: NLLP 2022.

paper

HoneyGAN: Creating Indistinguishable Honeywords with Improved Generative Adversarial Networks

Fangyi Yu, Miguel Vargas Martin. In European symposium on research in computer security (ESORICS) workshop: STM 2022.

paper

GNPassGAN: Improved Generative Adversarial Networks For Trawling Offline Password Guessing

Fangyi Yu, Miguel Vargas Martin, IEEE European symposium on security and privacy workshops: EuroS&PW 2021.

paper

On Deep Learning in Password Guessing, a Survey.

Fangyi Yu. ArXiv pre-print 2022.

paper

Selected Articles (Full list)

Research Methods Involving Human Studies

Towards Data Science, Apr, 2022.

This article discusses the lifecycle of a research experiment involving human subjects, including identifying the research hypothesis, defining the study design, conducting a pilot study to evaluate the research's design, system, and instruments, recruiting participants, conducting the actual data collection sessions, analyzing the data, and reporting the results.

read more

How to Build a Fake News Detection Web App Using Flask

Towards Data Science, Aug 2021.

The spread of fake news is unstoppable with the adoption of different social networks. On Twitter, Facebook, Reddit, people take advantage of fake news to spread rumours, win political benefits and click rates. Detecting fake news is critical for a healthy society. From a machine learning standpoint, fake news detection is a binary classification problem; hence we can use traditional classification methods or state-of-the-art Neural Networks to deal with this problem. This tutorial creates a natural language processing application from scratch and deploy it on Flask.

read more

A Thorough Guide to Time Series Analysis

Towards Data Science, Jul 2021.

This article provides a comprehensive guide to time series analysis, including topics such as what time series data is? what time series data components are? how time series data is used? what is the purpose of time series analysis? The most commonly used time series forecasting approaches (statistical and machine learning), as well as an end-to-end example of predicting climatic data using a machine learning model.

read more

Fangyi Yu