Student Research Projects

Spring 2017 Projects

A/B Testing and Predictive Models

“Amyloid positivity” is a key risk indicator of Alzheimer’s disease. Amyloid status is considered to be positive when Amyloid Beta (A) protein, also referred to as amyloid plaque, is accumulated in the brain with sufficient density to meet a threshold. The goal of this capstone project is to use machine learning and other advanced analytics approaches to construct a model that predicts whether a single individual is amyloid positive or negative. The potential for this project is that your deliverables are integrated into Biogen’s Alzheimer’s treatment pipeline.

See also: Biogen, 2017

Analyzing unfulfilled query data in Tripadvisor

On Tripadvisor, customers can query anything ranging from restaurants, tour guides, flights, and hotel booking. The website suggests related reviews, pictures, ratings and suggestions given by other visitors. The client also supports online booking, allowing customers to easily book their travels with one-click.  Our project is to implement an on-line clustering algorithm for review data written by customers. On some timed-interval basis, we would like to classify new clusters of those reviews into categories and also connect them with sentimental analysis, e.g. to be able to identify items that customers liked the most or complained about the most.

Currently, the client still faces some challenges. One problem suggested to our team is how to better utilize reviews given by customers that have been collected. By analyzing those data, the company hopes to better understand business providers’ services so as to improve the website’s recommending accuracy. Furthermore, understanding customers’ needs will also help TripAdvisor’s service and discover the hidden potential markets in certain areas. We (the Team, comprising the individuals listed above) hope to meet this challenge.

As of right now, our team has managed to reorganize data given by the company and scraped online. With the large amount of data available, the team is currently working on hotels and restaurants in Boston area. Utilizing many packages available online, the team has been trying to apply sentimental analysis and topic models on those data. In the meantime, our team is also exploring other potential tools available mentioned in the recent literature as well.




See also: Tripadvisor, 2017

Creating a better revenue model for MBTA

Massachusetts Bay Transportation Authority, a.k.a. MBTA, is the public transit agency operating most transit in the Greater Boston area, including busses, subways, and trains. The MBTA operates with high-level averages of revenue data, but does not have access to a detailed model of fares across different routes, times and dates, modes of transit, passenger profiles, and other characteristics. The goal of this project is to create a more granular cost model using existing passenger transaction data.

Such a model can be used to analyze bus route efficiency in greater detail than is currently possible, and then enable further exploration. We've received an initial data set of ~275 million "boardings" for MBTA subway and bus trips taken duringthe 2016 calendar year. Based on this dataset, and schedule information obtained from the MBTA’s publicGTFS API, we’ve completed some initial data exploration and built an initial revenue model.

Data Collection, Management and Cleaning

The City of Como project is a collaboration with Fluxedo, an Italian startup working in partnership with the municipality of Como to model human dynamic flow in the city.  The overall aim of the project is to integrate multiple and diverse data sources to build a picture of the way people live and move around the city. Using historical telecom and social media data along with other geolocated data, the team will form a coherent picture of the daily movements of different demographic groups throughout Como, dependent on the day, time, and other factors such as weather and events.

The end result will be an interactive visualization for visitors and residents to generate crowdsourced recommendations of how to spend their time in the city.

The Como project will focus on the city of Como, a small medieval town beautifully located on Lake Como in Northern Italy, with a large alking area in the downtown district and along the lakeshore. The project consists of collecting and analyzing data about the city and the way people live and move in it by integrating multiple and diverse data sources. The problems to be addressed are:

  • Providing a reliable estimate of the overall picture of people density
  • Predicting the impact of future events positioned in time and space
  • Given a constrained budget and a cost model for sensors deployment

Image emotion classification in social media websites

Legendary Entertainment is a media company that produces blockbuster films. While advertising an upcoming film, they need to know which audience is responding to their ads and how. To this end, they scrape data from social media websites to monitor conversations about their ads, and they adjust their campaigns accordingly to maximise effectiveness. Today, users are increasingly using images to express emotions and feelings on social media.

To keep up with this trend, Legendary wants to be able to quickly interpret usage of these images relating to their films. In this project, the team will develop a tool to learn and be able to classify the sentiments expressed through images from social media about movies.


Machine learning-assisted medical image annotation

Machine learning has emerged in recent years as a powerful tool for many tasks across a wide number or disciplines. This has held true in biomedical imaging, where machine learning-based technologies have the potential to improve the efficiency and accuracy of imaging specialists by automatically identifying and measuring key findings within image data. Unfortunately, those automatic tools do not exist yet, and manual annotation is the common, time-consuming, standard. The purpose of this project is to develop a medical image annotation tool that will allow researchers to label medical imaging data in a facile manner and predict annotation in an automated fashion.

Power of Words: Lyric-based music recommendation

The goal of this project is to leverage the rich content of song lyrics to connect each song with relatable concepts such as moods, occasions, and themes. A direct application of this automatic tagging system would be to produce playlists associated with different emotions or serve specific purposes (after break-up songs, holiday music, party mix, et cetera). An initial target for final product would be a collection of moods and topics that a user can select to retrieve an associated list of songs. A more advanced version would allow the user to type in a specific emotion or adjective and listen to a list of related songs. The ultimate goal is to help create an interactive and highly personalized music experience for the users.

If time permits, we might be able to extend the project further in either modeling or research directions. A modeling enhancement would be to not only process lyrics but also take into consideration other characteristics of songs such as genre, vintage, writer, singer, et cetera, when making connections between them. A research potential of interest, on the other hand, would be to analyze and/or visualize lyrical themes across time. Overall, depending on the data accessibility and quality, we see many potentials in this project and aim to explore various options along the way with the end goal of producing personalized music experience for users in mind.




See also: Spotify, 2017

Sentiment Analysis and Predictive Models

Moleskin’s philosophy is culture, travel, memory, imagination and personal identity. The goal of this project is to find influencers by looking at users interactions and to target them across the different social platforms. For example, we will look at how people connect in Twitter and create a weighted graph using both following numbers and @mentions. We will look at all platforms and cluster groups of posts by trending topics using LDA. This can be applied to all sources of media. We will then try to identify if trending topics and influencers are common across social platforms.

Also we will apply reverse engineering using the Leuchtrum case. What were the influencers doing before being sponsored by the company? The project will explore the popularity and success of different Moleskine products co-branded with other famous brands (also known as special editions) and launched during specific periods of time. The main field of analysis is measuring the impact of different products on social media channels and correlating that to sales.


See also: Moleskine, 2017

Social media engagement for cosmetic brands

Tribe Dynamics is a San Francisco based startup that measures social media engagement for cosmetic brands. Online content creation led by beauty bloggers is one of the key predictors of offline revenue in this industry. This project focuses on investigating how hashtag usage spreads across a social network of instagrammers who post about beauty products. The goal of the project is to model probabilistically each person’s propensity to use a hashtag based on whether their friends also use the hashtag, and to determine the characteristics of a successful marketing campaign using hashtags.

See also: Tribe Dynamics, 2017

Spotify playlist prediction

Spotify is a music, podcast, and video streaming service with 100 million active users. The company curates playlists that are followed by millions of users. These playlists are created by a combination of algorithmic and human-driven processes. The aim of our project is to make use of machine learning algorithms to improve the effectiveness of algorithmically curated playlists and to analyze what audio features contribute to the popularity of playlists.

Spotify attempts to direct the most relevant songs to users based on their preferences, moods etc. An enhanced version of our project would include generation of user-specific playlists based on genre, mood etc. The success of a playlist depends on certain features which need to be determined. We are using two datasets for our project. The key dataset is the set of audio features of tracks and playlists obtained from Spotify API. Additional features can be added from Million Song Dataset, a freely available collection of audio features and metadata about 1 million popular tracks. Some of the important audio features are loudness, energy, danceabililty, beats per minute etc. An extended and time-series version of these audio features can be obtained by processing 30 seconds raw audio obtained using Spotify api which will be targeted in the later phase of the project.

See also: Spotify, 2017

Past Projects

Negotiation Tool for Airbnb

Airbnb is a global marketplace of rentals of apartments that reach 190 countries and 34,000 cities. In Airbnb, citizens insert their rental offers and rent their own apartments to other citizens, thereby defining a parallel market to traditional offers based upon hotels. We propose to integrate data from Airbnb with data from other sources, including open data, census information, real estate, information about the district, about the house interiors, social sources such as Instagram and Twitter, etc., so as to develop a new scoring system for Airbnb offers, similar to the hotel star system.

Students: Jack Qian, Qing Zhao, Giovanni Battista, Michele Inverizzi 

Nester for Design

Nester is a platform where companies can find the best designs for a project, using Kaggle. Kaggle is a platform that hosts machine learning competitions where companies and researchers post data and pose challenges. Data scientists from all over the world compete to answer the questions and to produce the best results, in effect, crowdsourcing the most efficient technique or solution to the questions.

Through Nester, companies post brief design challenges. Designers then propose solutions and vote for other people's projects. Experts refine projects. Companies give feedback refine and select the best ideas. Finally, users pledge for their favorite product and we have a Winner!


Pipeline for Identifying Potentially Hazardous Asteroids

Potentially hazardous objects (PHOs) are currently defined based on parameters that measure the object's potential to make threatening and close approaches to the Earth. To be considered a PHO, objects generally have an Earth minimum orbit intersection distance (MOID) of 0.05 AU or less and an absolute magnitude (H) of 22.0 or brighter (a rough indicator of large size). In collaboration with the Harvard-Smithsonian Center for Astrophysics, the goal of this project is to develop the full pipeline which includes data management, algorithmic development and probabilistic predictions of impact.

Students: Matt Holman, Michael Lackner

Stochastic Query Optimization and Bias Characterization for Large Scale Text Search

Legendary is a leading film production company, with 43 Feature films released, 6 films currently in production and 13 billion box office until 2015. Identifying the correct search terms to find social media posts about an entity or concept is a highly challenging task. For instance, the word Fargo may refer to a place (in North Dakota), a TV show, a movie, or a bank (Wells Fargo). The student team analysed 4 million tweets to produce a text-query generation & optimization system. The search index query, constructed from combinations of text tokens constrained to simple logical operators, returns a highly pure set of text documents relevant to a property, such as a film, and also provide a characterization of the query bias.

Students: Siv Lu, Chloe Liu, Zeling Qiu

Dynamic Factor Selection for Determining Market Exposure

Market exposure is a key concept in quantitative finance. This is classically measured by estimating a beta coefficient in a linear equation where beta (exposure) expresses the returns of the market. Returns with low exposure to the market are desired, as they are not affected by downturns. This exposure modeling can be generalized to multiple factors and the exposures to factors are used to determine if a strategy or asset is protected enough from changes in certain risk factors, and to purchase hedges that cancel out this risk exposure.

Student: Delaney Granizo-Mackenzie 

See also: Quantopian, 2016

Restaurant Photo Classification Algorithm and Business Viability Tool for Tripadvisor

TripAdvisor users write reviews and upload photos from their various restaurant visits. These photos can be categorized/analysed so they can reveal information about the restaurant's menu, dishes, pricing, etc. The first step in this analysis is the classification of photos into simple, broad groups: food, drinks, menus, inside and outside photos of the establishment. Students' goal for this project was to build an image classifier using Convolutional Neural Networks and images aquired by the students themselves.

An Attempt to Improve the MBTA Through Data

The MBTA serves 4.8 million people throughout the Boston metro area and facilitates approximately 1.3 million trips each weekday. Aggregated entry and exit data is collected for each rail station at 15-minute intervals. Since commuting is one of the most habitual acts a metropolitan citizen performs, this data provides excellent means to predict ridership throughout the week.

Students: Aaron Zampaglione, Filip Piasevoli, Lyla Fadden, Micah Lanier

See also: MBTA, 2015

Boston Globe Subscriber Conversion

The typical cyber-life of a BostonGlobe user starts with anonymous visits- from casually visiting the site, to ultimately becoming a subscriber. The Boston Globe would like to understand the idiosyncrasies and patterns of a subscriber and use that knowledge to increase subscriptions conversion rates.

Students: Jeffrey Shen, Kai Sheng, Simon Malian