Elementary school classroom assignment puzzles and Indian matrimonial website data inspire student projects
by Anita Mehrotra
The Computational Science and Engineering program may have earned a reputation in SEAS and beyond as being highly intense, but Master’s students in IACS are already applying the skills they’ve learned in the classroom to real-world projects and creative independent research opportunities.
Data Science as a Tool for Sociological Exploration
In his fall independent research project, Master’s candidate Nikhil Sud (S.M. ‘14) aggregated and explored data from online Indian dating sites like SimplyMarry.com in the hopes of gaining insight into sociological biases in India, and within the Indian diaspora. Conducted under the supervision of IACS Scientific Program Director and Lecturer Pavlos Protopapas, Sud found correlations among skin tone, income and economic class.
“When I proposed this project, I anticipated finding out more about caste and skin-tone preferences, since Indian matrimonial sites are infamous for bringing out these prejudices,” said Sud. “However, as I began to examine the data, trends related to weight, gender-roles, location, income and religion came up. For instance, one of the fields asks users whether they drink. Users who select ‘occasionally’ have the highest average income, followed by users who select ‘yes’ and then finally users who select ‘no’!”
Sud leveraged Python’s extensive statistical and visualization modules, then applied a combination of location-based, text, and clustering analyses to move beyond sufficient statistics. Outcomes of the project included an ability to predict the gender of a profile, because “when men and women describe themselves they mostly stick to predefined gender roles. Men use words like ‘football’, ‘cars’ and ‘bikes,’ and women use words like ‘dancing’, ‘cooking’, and ‘painting’.”
Sud was able to extend the predictions to authorship, i.e. identifying who actually created the online profiles. Parents, for example, often focused on educational and career achievements, whereas children focused on interests and personality. K-means, density-based scan, and expectation-maximization clustering showed that there appeared to be a marriage market that treated people as if they existed on a ladder: those having better attributes were placed higher up than those with “worse” attributes (most-valued versus least-valued). Almost every attribute explored could be placed on a good/bad scale, with higher income corresponding to fairer self-reported skin tones and more specific partner preferences, such as men seeking women of lower weight.
“My favorite part of the project was discovering new insights that weren’t anticipated,” concluded Sud. “Working on an independent project was great because it allowed me to creatively apply data science to a data set that was very different from anything I had worked with before.”
Parallelization in the (Elementary School) Classroom
Master’s candidates Daniel Newman (S.M. ‘14) and Isaac Slavitt (S.M. ‘14) used their final project from CS 205, Computing Foundations for Computational Science, to efficiently group nearly one hundred elementary school students into four classes. Newman and Slavitt solved what was framed as an integer linear program (ILP) using parallel computing.
Many elementary schools struggle with the amount of time spent placing students into classes. “The school that I worked with would spend three to four weeks every summer just grouping students based on a set of constraints,” shared Newman.
An algorithm Newman co-designed in 2012 did a good job of placing students into classes based on those constraints, but took nearly four hours to run. Together, Slavitt and Newman created a framework that used a combination of Python and OpenMPI (a Message-Passing Interface library) to run the algorithm on 256 cores in only four minutes – a sixty-fold improvement over the initial algorithm.
“To me, the best part of the project was coming up with an algorithm that clearly gave a better practical solution to the school's problem. The old algorithm that preceded it… only gave one solution. Sometimes, the school looks at a solution that is good mathematically, and says ‘this isn’t going to work, it has a problem that we forget to tell you to include in your model,’” said Newman. The school would then be faced with re-running the entire algorithm, or re-allocating students in an inefficient, by-hand method. “Our [parallelized] algorithm, on the other hand, returns several different but viable solutions. As a result, if a school doesn’t like any one solution, it is easy to offer a plethora of alternatives, because you can leverage the 4-core processor on a desktop and still get considerably significant speed-ups.”
Isaac Slavitt (center) and Daniel Newman (right) discuss their work with an audience member from the 2014 INFORMS Analytics Conference. They used parallel computing to speed up an algorithm for grouping students into the best class groups.
On March 30, Newman and Slavitt presented their work as a poster at the prestigious 2014 INFORMS Analytics Conference in Boston, MA. In addition to sharing their work with the wider computing audience, Newman and Slavitt gained useful feedback for next steps on their work, like the inclusion of slack variables to indicate the unfeasibility of a solution, and the creation of a GUI-interface for non-experts to interact easily with the solution.
“It's gratifying to see a computationally intensive task go from taking something like four hours to two minutes,” Slavitt commented.
More information about the projects featured in this article can be found here:
Nikhil Sud, Indian Online Matrimony Data Exploration: http://projects.iq.harvard.edu/matrimony_data_exploration
Daniel Newman and Isaac Slavitt, Using Parallel Computing and ILP to Place Students in Elementary School Classes: http://isms.github.io/cs205-fall13-project/