'Improving Discrete Time BTYD Model with Covariates and Non-Parametric Priors'
Senior Thesis: Wharton Research Scholar
Using a simulation-based approach, the paper found that models with non-parametric priors is capable of adequately picking up simplified parametric distribution shapes without drastically overfitting and offer some predictive improvement in cases where a multi-modal distribution exists. In addition, the paper also found that models missing covariate specification, when such effect is present, may generate systematic upward bias to the parameter values. The paper also performed a market simulation that showed how the covariate effect extracted from the models can help firms better perform targeted marketing to improve its ROI.
'The Startup Ecosystem and a Preference Based Recommendation System'
Master's Independent Study: Data Science Practicum
Referencing historical marketing literature on hierarchical choice models, the paper derived a generalizable ML model with the necessary features that can approximate a multi-level structure. The final model perform better than conventional model and random benchmark across both seasoned individual and new individual who has had no application history. In addition, the paper, using model diagnostic techniques such as variable importance plot, found that, while the order of individual preferences did not change drastically, some aspects of job listings did appear more important to users now more than before due to the pandemic.
'Quantifying NBA play style drift using Finite Mixture Multinomial model'
Senior Capstone: Sports Analytics
NBA play style now has often been associated with more three-pointer attempts and fewer inefficient mid-range shots. However, while such aggregate trend is trivial to analyze, we pose three outstanding questions that the aggregated trend fails to encapsulate. First, is that there is a growing volume of three-point specialists coming into the league. Second, is that there is a change in play style across all players. Third, is that such trend is heterogeneous across top and bottom performing teams. Through player clusters created from a Finite Mixture Multinomial model, we found that the overall shift towards three-pointers seen in the NBA is attributable both to a growing volume of rookies who are focused on predominantly shooting threes and also a more general trend where all around players are transitioning to become three point specialist.
'An Empirical Analysis On Disparate Impacts of
the London 2012 Olympics'
Citadel Data Open: First Place
The London Olympics was certainly a mega event,
and while studies have been done extensively on the
economic impact of the London Olympics overall,
there currently lack literature specifically targeting
how has the London Olympics affected the more
intangible aspects of its various boroughs. Using an innovative causal inference methodology, geographical analysis, and ML attribution, we found that the Host city only obtained short term gain from the Olympics. Where as the real impact of the Olympics is likely more a multiplier to the currently already well-off and fast developing cities in London (centered in West London) that can take advantage of the intangibles offered by the event (e.g. increasing awareness and better infrastructure)
'A Driver Lifetime Value Report'
Lyft Data Challenge: Second Place
The paper applied a hybrid BG-NBD customer lifetime values model and KNN clustering to estimate the lifetime value of Lyft Drivers. The paper found 3 distinct driver clusters with differing CLV offering a clear segmentation and targeting scheme for the company. Strategic recommendations were provided in line with this insight.
'Economics of COVID-19 Lockdowns: Optimizing the Lockdown Health Economy Tradeoff'
Wharton Hackathon: First Place
The Covid-19 pandemic has forced many countries to
use lockdowns as a public health measure to prevent
further spread of the disease. In contrary to popular belief, this study from a theoretical level showed that lockdown length, stringency and efficiency is not a purely additive function. Lockdown length and stringency does not have a positive linear function with improved health outcomes. Instead, the the best approach to achieve an efficient lockdown is often a combination the right length with a lesser emphasis on stringency. Furthermore we also explored and realized that the lockdown length and stringency does not drastically affect the economic status of states due to the rise of other opportunities.
'Disrupting San Francisco with the Power of Data'
Facebook Data Challenge: First Place
Using public industry breakdown data of San Francisco. Our team created 3 distinct metric: 1. Baseline Size, 2. Scalability and 3. Demand satisfaction, to evaluate the most long-term profitable sector. Using a funneling down approach, we recommended the restaurant industry due to its ability to transcend geographical boundaries within the city. In line with the analysis, our team also offered direct product recommendation that leverages these data points to improve Facebook's products as a whole.
'Two Step Property Pricing Model'
Brown Datathon: First Place
Our Goal is to make prediction despite the
noise and randomness - the overarching
strategy is foundational in its success. In particular, our engineered features are our story - every aggregated observable is prompted by macroeconomic element that affects the underlying propensity for individuals to buy houses. To synthesize and generalize albeit the noisy granular data, we built two independent yet related models. First, a classification model, regressing on
non-zero predicted zip5s, and Secondly, an allocation model to break our prediction down to zip9s grain for our final predictions.
'Game Recommendation: A Story-Driven Approach'
EA Datathon: Third Place
We believe that a user’s decision to play a game
is not a random process, but one prompted by
certain fixed and traceable drivers. A good
recommendation engine should not be one that
overlooks these characteristic but one that
actively accounts for them. We propose that all individualized metrics (popularity, social network effects and etc.) can each be seen as a mini recommendation engine on its own. However to achieve a truly customer centric product, it is necessary for the final model to be
an encompassing one that accounts for each
metric, performs considerably well and handles
edge cases with grace.
'A Probability Modelling Approach To Assess COVID 19: A Instacart Case Study'
STAT 476 Final Project: Top Ranked of the Class
The COVID-19 global pandemic caused a shock to
the world. This impact extends from the frontlines of
a battle for global health to daily activities that were
formerly taken for granted. In particular, the
subsequent policies and self-isolating measures
taken by the vast population in the face of COVID-19
brought about an unprecedented change in
consumer behavior and demand for online based
food/grocery delivery softwares. Using a Finite Mixture Exponential Gamma model with a leaky logistic population extender, the paper explored the impact of COVID on the popular food delivery service Instacart.
'It Is All About Who: A Data Analysis on Loan Quality and Default'
STAT 471 Midterm Project; Grade: 16/15
The heightened demand comes with its unique
challenges. Lending Club, being a fast growing and
scaling company faces the prominent problem of
balancing its reach (continually serving a massive
span of individual) while maintaining the quality of
its product. It is paramount to Lending Club to
ensure the quality of loans in its portfolio without
comprising and denying too much services. In this
paper I took the lens of a Lending Club executive, ultimately optimizing for the company's loan selection using data science tools.
STAT 471 Final Project; Grade: 27/20
In this paper, our team seeks to explore a new way
to approximate the best price for a property by
using predictive tools. Our objective is to obtain a
model that can output an “expected market value” of
a property based on some features and fields
(physical or non-physical). We believe this
generalized approximation can be done by training
the model on open market data and pricing points of
past openly traded properties. We see the “expected
market value” as not only a reference point for the
value of the house, but also a point of opportunity for
undercutting the market, thereby gaining a
competitive advantage over other competitors.
'Class Review: A Statistical Machine Learning Approach to
Penn Course Ratings'
CIS 545 Final Project; Grade 100/100, Chosen as class example in the next class iterations
This project was conducted with the aim of gaining a better understanding of end-of semester course reviews that Penn students give. By utilizing textual data from course syllabi and descriptions, demographic data on faculty members and logistical information on courses, we (1) build models to predict the ratings that courses receive, (2) test hypotheses on factors influencing and biases in ratings and (3) cluster courses based on their descriptions.
'Stroke Lesion Segmentation with 3D Convolutional Neural Networks'
CIS 522 Final Project; Grade: 540/550
The uneven shape and location of lesions, coupled with
lesion size variance between stroke patients pose as major
obstacles to the segmentation process. Unlike most image processing tasks which involve primarily with 2D images, our project handles 3D MRI scans. Thus we implement a 3D Convolutional Neural Network with 5 convolutional layers and 2 fully connected layers. Finally, when processing each voxel, we also feed in the symmetric voxel in order to utilize the quasi-symmetry property of the brain for our segmentation. The model from our paper outperformed conventional ML and its 2D CNN counterpart.