Students in the Capstone Research course apply skills such as machine learning, statistics, data management, and visualization to solve real-world problems.

Founded by IACS Scientific Program Director, Pavlos Protopapas, the Capstone Research course is a group-based research experience where students work directly with a partner from industry, government, academia, or an NGO to solve a real-world data science problem. Students will propose a solution in the form of a software package, a set of recommendations in a report, or a research paper.  Upon completion of this challenging project, students will be better equipped to conduct research and enter the professional world. For information about becoming a Capstone Project partner, please contact Cathy Chute, IACS Executive Director. 

Fall 2019 Projects

Back-Translation for Named Entity Recognition

Partner: Kensho

NER back translation kensho graphic

The original question of our project was whether we could incorporate information from a knowledge base such as WikiData to improve performance on NER.

We explore several methods for constructing type-specific vocabularies compiled from the knowledge base and show the non-triviality of compiling and cleaning this data. We then explore several methods of incorporating these vocabularies to learn an NER classifier trained on Wikipedia articles in a weakly supervised way. We demonstrate the challenges of incorporating non-contextual information in a setting where context is key. Lastly, we show how we can incorporate ideas from low resource neural machine translation to improve the generalizability of NER classification.

Building An Image Recommendation System For News Articles using Word and Sentence Embeddings

Partner: Associated Press

associated press logo

Working in collaboration with the Associated Press (AP), this capstone group built a Text-to-Image recommendation system to recommend a set of images using headline captions. 

Since Machine Learning methods cannot optimize text directly, the team converted text to a numerical representation using word embeddings, which are means by which a word can be represented as a vector of numbers.

Computer Vision for Automatic Road Damage Detection

Partner: Lab1886

Daimler Road Hazard Detection

Deteriorating roads plague areas with highly volatile weather and budgetary constraints. It’s a constant challenge for municipal governments to keep ahead of the wear and tear as they catalogue and target hot spots to fix.

In the U.S., most states only employ semi-automated methods for keeping track of road damage, and in other parts of the world, the process is completely manual, or foregone altogether. The costly and time-consuming procedure for collecting these data is only compounded by the fact that it must be done with relatively high frequency to ensure the data are up to date. This begs the question: can computer vision help?

Machine Learning for Urban Planning: Estimating Parking Capacity

Partner: City of Somerville, MA

Sign that says entering Somerville Inc. 1842

If everything continues as planned, Somerville, Massachusetts — a city just outside of Boston — will be getting a new subway line in 2021. Though the new line is exciting, it may cause issues for the existing citywide resident on-street parking program. 

To address transportation planning questions, Somerville is conducting an audit of their parking supply. They have a good estimate of on-street parking capacity, but they have much less data about off-street parking. Their question is deceptively simple: how many residential units in Somerville have off-street parking?

Named Entity Disambiguation Boosted with Knowledge Graphs

Partner: Kensho

Kensho named entity disambiguation chart

Named Entity Disambiguation (NED), or Named Entity Linking, is a natural language processing (NLP) task which assigns a unique identity to entities mentioned in text.

This can be helpful in text analysis. For example, a financial company may want to identify all companies mentioned within a news article, and subsequently investigate how the relations between the companies might affect the markets.

Optimal Real-time Scheduling for Black Hole Imaging

Partner: Event Horizon Telescope

black hole.

In April 2019, the Event Horizon Telescope (EHT) Collaboration released the first image of a black hole.  

To accomplish this, the EHT used radio dishes across the globe simultaneously recording radio waves from near the black hole, synchronized by Global Positioning System (GPS) timing and referenced to atomic clocks for stability.EHT observations typically take place during a 10-12 day window with 5-6 days to be triggered when conditions are optimal.  This project's goal is to use machine learning and/or prediction methods to help the EHT determine which nights should be triggered for global observations. This is an opportunity for students to work with EHT scientists and engineers on various aspects of black hole science in order to assess the probability that observations will lead to breakthrough results.

Spotify Challenge: Offline Recommender System

Partner: Spotify

Spotify playing on a smartphone screen

One of the main challenges for Spotify is to recommend the right music to each user. Users' satisfaction can be monitored based on whether they skip the recommendation.

Therefore the goal of a good recommender system is to show users content they like, and to minimize the probability that they will skip a song. In this project, we present the problem of sequential music recommendations.

The Need for Efficient Neural Architecture Search (NAS)

Partner: Google

Deep learning frees us from feature engineering, but creates a new problem of “architecture engineering.”

Numerous neural network architectures have been invented, but the design of architectures often feels more like an art than science. In this project, we investigate an efficient gradient-based search method called DARTS (Differentiable Architecture Search).

DARTS is shown to require ~100x fewer GPU hours than previous methods like NASNet and AmoebaNet, and is competitive to the ENAS approach from Google Brain. We will compare DARTS to random search and state-of-the-art, hand-designed architectures such as ResNet.