Capstone Project Course: AC297r

Harvard IACS -- Capstone Research Project Course

Students in the Capstone course apply skills such as machine learning, statistics, data management, and visualization to solve real-world problems. In groups of three to four, students identify a complex and open-ended problem and work with the instructor, mentors, and industry partners to propose a solution in the form of a software package, a set of recommendations in a report, or a research paper.  Upon completion of this challenging project, students will be better equipped to conduct research and enter the professional world.

The Capstone course is available to all Harvard students. Enrolled and prospective students please visit the current Capstone course site.

To learn more about our partner criteria, visit our Partners page.


An Attempt to Improve the MBTA Through Data

The MBTA serves 4.8 million people throughout the Boston metro area and facilitates approximately 1.3 million trips each weekday. Aggregated entry and exit data is collected for each rail station at 15-minute intervals. Since commuting is one of the most habitual acts a metropolitan citizen performs, this data provides excellent means to predict ridership throughout the week.

Students: Aaron Zampaglione, Filip Piasevoli, Lyla Fadden, Micah Lanier

See also: MBTA, 2015

Analyzing unfulfilled query data in Tripadvisor

TripAdvisor is one of the largest travel website companies which adopts the user content model with nearly 50 million monthly visitors and millions of business reviews.  We aim to improve the search experience by learning fine-grained information about each business, namely the users' sentiments toward specific entities as expressed in their reviews. Our approach allows us to rank these sentiments and implement a novel search experience where results are sorted by sentimental intensity towards the item of interest. These search results can further be enriched by displaying other positively (or negatively) mentioned items to the users. In addition, we present results from clustering businesses based on reviews. Our experiments show that these clusters are potentially useful for further optimization of the search engine.

Check out the project demo website here.



See also: Tripadvisor, 2017

Boston Globe Subscriber Conversion

The typical cyber-life of a BostonGlobe user starts with anonymous visits- from casually visiting the site, to ultimately becoming a subscriber. The Boston Globe would like to understand the idiosyncrasies and patterns of a subscriber and use that knowledge to increase subscription conversion rates.

Creating a better revenue model for MBTA

Massachusetts Bay Transportation Authority, a.k.a. MBTA, is the public transit agency operating most transit in the Greater Boston area, including busses, subways, and trains. The MBTA operates with high-level averages of revenue data, but does not have access to a detailed model of fares across different routes, times and dates, modes of transit, passenger profiles, and other characteristics. The goal of this project is to create a more granular cost model using existing passenger transaction data.

Such a model can be used to analyze bus route efficiency in greater detail than is currently possible, and then enable further exploration. We've received an initial data set of ~275 million "boardings" for MBTA subway and bus trips taken duringthe 2016 calendar year. Based on this dataset, and schedule information obtained from the MBTA’s publicGTFS API, we’ve completed some initial data exploration and built an initial revenue model.

Data Collection, Management and Cleaning

The City of Como project is a collaboration with Fluxedo, an Italian startup working in partnership with the municipality of Como to model human dynamic flow in the city.  The overall aim of the project is to integrate multiple and diverse data sources to build a picture of the way people live and move around the city. Using historical telecom and social media data along with other geolocated data, the team will form a coherent picture of the daily movements of different demographic groups throughout Como, dependent on the day, time, and other factors such as weather and events.

The end result will be an interactive visualization for visitors and residents to generate crowdsourced recommendations of how to spend their time in the city.

The Como project will focus on the city of Como, a small medieval town beautifully located on Lake Como in Northern Italy, with a large alking area in the downtown district and along the lakeshore. The project consists of collecting and analyzing data about the city and the way people live and move in it by integrating multiple and diverse data sources. The problems to be addressed are:

  • Providing a reliable estimate of the overall picture of people density
  • Predicting the impact of future events positioned in time and space
  • Given a constrained budget and a cost model for sensors deployment

Dynamic Factor Selection for Determining Market Exposure

Market exposure is a key concept in quantitative finance. This is classically measured by estimating a beta coefficient in a linear equation where beta (exposure) expresses the returns of the market. Returns with low exposure to the market are desired, as they are not affected by downturns. This exposure modeling can be generalized to multiple factors and the exposures to factors are used to determine if a strategy or asset is protected enough from changes in certain risk factors, and to purchase hedges that cancel out this risk exposure.

Student: Delaney Granizo-Mackenzie 

See also: Quantopian, 2016

Image emotion classification in social media websites

Automatic image emotion classification is challenging because it requires models capable of recognizing emotion content in images, which can vary substantially. In addition, there was no image dataset with high quality labels large enough for learning these models until 2016. We have designed the system for Emotion Data Management and Analysis (SEDMA) not only for prediction of image emotion but also to actively improve the process of building high quality manually labeled datasets. SEDMA can potentially be used in a wide range of applications from automatic emotion recognition in smart devices to social media marketing decision-making. By using only 500 cinema-related images to fine tune a pre-trained deep learning model (Residual Network), a 59.1% top-2 class accuracy out of 8 classes was achieved through collaboration with Legendary Applied Analytics. 


Machine learning-assisted medical image annotation

Machine learning has emerged in recent years as a powerful tool for many tasks across a wide number or disciplines. This has held true in biomedical imaging, where machine learning-based technologies have the potential to improve the efficiency and accuracy of imaging specialists by automatically identifying and measuring key findings within image data. Unfortunately, those automatic tools do not exist yet, and manual annotation is the common, time-consuming, standard. The purpose of this project is to develop a medical image annotation tool that will allow researchers to label medical imaging data in a facile manner and predict annotation in an automated fashion.

Check out the project demo website here.

Negotiation Tool for Airbnb

Airbnb is a global marketplace of rentals of apartments that reach 190 countries and 34,000 cities. In Airbnb, citizens insert their rental offers and rent their own apartments to other citizens, thereby defining a parallel market to traditional offers based upon hotels. We propose to integrate data from Airbnb with data from other sources, including open data, census information, real estate, information about the district, about the house interiors, social sources such as Instagram and Twitter, etc., so as to develop a new scoring system for Airbnb offers, similar to the hotel star system.

Students: Jack Qian, Qing Zhao, Giovanni Battista, Michele Inverizzi 

Nester for Design

Nester is a platform where companies can find the best designs for a project, using Kaggle. Kaggle is a platform that hosts machine learning competitions where companies and researchers post data and pose challenges. Data scientists from all over the world compete to answer the questions and to produce the best results, in effect, crowdsourcing the most efficient technique or solution to the questions.

Through Nester, companies post brief design challenges. Designers then propose solutions and vote for other people's projects. Experts refine projects. Companies give feedback refine and select the best ideas. Finally, users vote for their favorite product and we have a Winner!


Pipeline for Identifying Potentially Hazardous Asteroids

Potentially hazardous objects (PHOs) are currently defined based on parameters that measure the object's potential to make threatening and close approaches to the Earth. To be considered a PHO, objects generally have an Earth minimum orbit intersection distance (MOID) of 0.05 AU or less and an absolute magnitude (H) of 22.0 or brighter (a rough indicator of large size). In collaboration with the Harvard-Smithsonian Center for Astrophysics, the goal of this project is to develop the full pipeline which includes data management, algorithmic development and probabilistic predictions of impact.

Students: Matt Holman, Michael Lackner

Power of Words: Lyric-based music recommendation

The goal of this project is to leverage the rich content of song lyrics to connect each song with relatable concepts such as moods, occasions, and themes. A direct application of this automatic tagging system would be to produce playlists associated with different emotions or serve specific purposes (after break-up songs, holiday music, party mix, et cetera). An initial target for final product would be a collection of moods and topics that a user can select to retrieve an associated list of songs. A more advanced version would allow the user to type in a specific emotion or adjective and listen to a list of related songs. The ultimate goal is to help create an interactive and highly personalized music experience for the users.

If time permits, we might be able to extend the project further in either modeling or research directions. A modeling enhancement would be to not only process lyrics but also take into consideration other characteristics of songs such as genre, vintage, writer, singer, et cetera, when making connections between them. A research potential of interest, on the other hand, would be to analyze and/or visualize lyrical themes across time. Overall, depending on the data accessibility and quality, we see many potentials in this project and aim to explore various options along the way with the end goal of producing personalized music experience for users in mind.

Check out the project demo website here.

See also: Spotify, 2017

Predicting Alzheimer's Disease

Alzheimer’s Disease (AD) ravages the cognitive ability of more than 5 million Americans and creates an enormous strain on the health care system. Our research explores prediction of AD without medical imaging, in hopes of earlier and cheaper diagnoses. We construct a classification pipeline which shows greater than 90% accuracy and recall in predicting AD with our best model. This model generalizes well to sub-studies of our main data set, ADNI, as well as another AD dataset, AIBL. We also find that we can get close to 79% accuracy with only one clinical visit of data. Finally, we produce a meta-classification algorithm which balances feature cost with accuracy. This work can be adapted into a diagnostic tool for maximizing accuracy while minimizing the amount of tests a patient needs to take for diagnosis.

See also: Biogen, 2017

Restaurant Photo Classification Algorithm and Business Viability Tool

TripAdvisor users write reviews and upload photos from their various restaurant visits. These photos can be categorized/analyzed to reveal information about the restaurant's menu, dishes, pricing, etc. The first step in this analysis is the classification of photos into simple, broad groups: food, drinks, menus, inside and outside photos of the establishment. Students' goal for this project was to build an image classifier using Convolutional Neural Networks and images aquired by the students themselves.

See also: Tripadvisor, 2016

Sentiment Analysis and Predictive Models

Moleskine’s philosophy is culture, travel, memory, imagination and personal identity. The goal of this project is to find influencers by looking at users' interactions and to target them across different social platforms. For example, we will look at how people connect in Twitter and create a weighted graph using both following numbers and @mentions. We will look at all platforms and cluster groups of posts by trending topics using LDA. This can be applied to all sources of media. We will then try to identify if trending topics and influencers are common across social platforms.

Also we will apply reverse engineering using the Leuchtrum case. What were the influencers doing before being sponsored by the company? The project will explore the popularity and success of different Moleskine products co-branded with other famous brands (also known as special editions) and launched during specific periods of time. The main field of analysis is measuring the impact of different products on social media channels and correlating that to sales.

See also: Moleskine, 2017

Social media engagement for cosmetic brands

Tribe Dynamics is a San Francisco based startup that measures social media engagement for cosmetic brands. Online content creation led by beauty bloggers is one of the key predictors of offline revenue in this industry. This project focuses on investigating how hashtag usage spreads across a social network of instagrammers who post about beauty products. The goal of the project is to model probabilistically each person’s propensity to use a hashtag based on whether their friends also use the hashtag, and to determine the characteristics of a successful marketing campaign using hashtags.

See also: Tribe Dynamics, 2017

Spotify playlist prediction

Spotify is a music, podcast, and video streaming service with 100 million active users. The company curates playlists that are followed by millions of users. These playlists are created by a combination of algorithmic and human-driven processes. The aim of our project is to make use of machine learning algorithms to improve the effectiveness of algorithmically curated playlists and to analyze what audio features contribute to the popularity of playlists.

Spotify attempts to direct the most relevant songs to users based on their preferences, moods etc. An enhanced version of our project would include generation of user-specific playlists based on genre, mood etc. The success of a playlist depends on certain features which need to be determined. We are using two datasets for our project. The key dataset is the set of audio features of tracks and playlists obtained from Spotify API. Additional features can be added from Million Song Dataset, a freely available collection of audio features and metadata about 1 million popular tracks. Some of the important audio features are loudness, energy, danceabililty, beats per minute etc. An extended and time-series version of these audio features can be obtained by processing 30 seconds raw audio obtained using Spotify api which will be targeted in the later phase of the project.

Check out the project demo website here.

See also: Spotify, 2017

Stochastic Query Optimization and Bias Characterization for Large Scale Text Search

Legendary is a leading film production company, with 43 feature films released, 6 films currently in production and $13 billion in box office revenues in 2015. Identifying the correct search terms to find social media posts about an entity or concept is a highly challenging task. For instance, the word Fargo may refer to a place (in North Dakota), a TV show, a movie, or a bank (Wells Fargo). The student team analyzed 4 million tweets to produce a text-query generation & optimization system. The search index query, constructed from combinations of text tokens constrained to simple logical operators, returns a highly pure set of text documents relevant to a property, such as a film, and also provide a characterization of the query bias.