Students tackle astronomically big data challenges in Chile

March 21, 2014
Students tackle astronomically big data challenges in Chile

Harvard students applied their CSE training to an unsolved real-world problem during Wintersession

In January, many think about heading south to escape the Boston winter. During the January 2014 winter break, six Harvard graduate students ventured all the way to Chile not for a vacation, but rather for two and a half weeks of hard work on a computationally challenging problem produced by a cutting-edge astronomical instrument.

Designed by IACS Scientific Program Director and Lecturer Pavlos Protopapas, the two-week Chile-Harvard Innovative Learning Exchange Program (or CHILE Program) gave students the opportunity to work in international teams with noisy and imperfect data sets that challenged the limits of textbook methodologies. The immersive international research project was launched with generous funding from the Harvard-Chile Innovation Initiative through the agreement of Harvard’s David Rockefeller Center for Latin American Studies with Conicyt-Chile, the Institute for Applied Computational Science at Harvard SEAS, the Millennium Institute of Astrophysics (MAS), and the Center for Mathematical Modeling (CMM) at the Universidad de Chile.

Anita Mehrotra, a master’s student in the Computational Science and Engineering (CSE) program, always had a fascination with stars, constellations, and galaxies. However, as an undergraduate she made the choice to pursue math, economics, and computer science instead of astronomy.

Anita Mehrotra and Yang Chen in Santiago

Anita Mehrotra and Yang Chen in Santiago

The CHILE program gave her the opportunity to learn about astronomy and to realize that as a data scientist, she had the ability "to do something so outside my domain of expertise, but still contribute. I may not be an astronomer, but I built tools that scientists at the Center for Mathematical Modeling in Chile will use in their future astronomy research.”  

Harvard student participants included three CSE master’s students, two astronomy PhD students, and Yang Chen, a PhD student in statistics. In her applied statistics research, Chen uses time series data, but she had limited experience with the high-dimensional data and signal processing common in astronomy. Harvard offers a course in astrostatistics, but like many PhD students, Chen had invested her time and energy in the courses that would enable her work on her thesis project.

Chen applied to participate in CHILE because “[astronomy] is a field I never touched before, and I was so interested. Nowadays data scientists and statisticians get involved in astronomy projects. In applied statistics, you have to find different fields to work on.”

Chen may work on astronomy projects after she graduates, but whether or not she returns to astronomy, the experience of collaborating with researchers from another field was good preparation for her future work as an applied statistician. She is unreservedly glad she went on the trip. “It was one of the most unique experiences in my time in grad school,” she said.

Shedding Light on Heavenly Bodies with Data Science

The six Harvard and six Chilean students collaborated on one not-so-simple task: developing fast, cost-effective computational methods to process and analyze data from the Dark Energy Camera (DECam). The DECam is one of the instruments used in the Dark Energy Survey (DES), an international effort involving Europe, United States, Brazil, and Chile “designed to probe the origin of the accelerating universe and help uncover the nature of dark energy by measuring the 14-billion-year history of cosmic expansion with high precision.”

Cerro Tololo

The Cerro Tololo Inter-American Observatory

Chile, with dry desert skies and high plateaus that are ideal locations for astronomy, has become host to the world’s largest and most sophisticated observatories.

DES scientists are currently wrestling with the challenges of managing and analyzing the large amounts of data they receive daily from the DECam. “[Students] have enough experience from their homework with problems that have already been solved,” said Protopapas. “Part of the philosophy of the CHILE Program was to give students a chance to tackle an unsolved problem that involved real, messy data.” 

Protopapas and faculty from the Universidad de Chile and Pontificia Universidad Católica de Chile divided the students into three teams, each with Harvard and Chilean students, making sure each team included students with backgrounds in astronomy, computer science, and mathematics or statistics.

One team took raw data files from the telescope and used those images to produce light curves that revealed the varying light reaching Earth from points on the sky over time. Another wrote programs to determine what type of celestial objects created the curves (variable stars, galaxies, etc.) and also to determine the probability that their classification was correct.

The third group worked on two separate projects. While waiting for the first two teams to finish their work, they used the Python programming language to create a tool that searches information that astronomers have already collected and archived in databases. The module generates a file with the location of a celestial object and all currently existing information about the object. The tool provided additional data to supplement the raw telescope data and improve the accuracy of object classifications.

The group conducted a post-classification analysis to determine how much the mathematical certainty of the classifications would improve if astronomers made additional observations. They also developed a script to calculate the cost of additional observations to help scientists make cost-effective and efficient decisions.

“I loved that I could leverage material I had learned from my classes like CS 281 [Advanced Machine Learning]. Our group ultimately used MCMC [Markov chain Monte Carlo methods] and Gaussian processes, to conduct the post-classification analysis,” said Anita Mehrotra. This semester, Mehrotra is taking a spatial statistics class to go even deeper into those concepts.

As a group leader, Mehrotra found it challenging to “effectively use the strengths of the different team members.” Working on teams with people who had different types of training was challenging, but ultimately this diversity helped the team develop “a great product because we were coming at the problem from several different angles.”

At one point, Mehrotra thought she had a bug in her code: it wasn’t able to take the log of negative numbers. Through discussion with the astronomers in her group, she discovered that the data she was working with shouldn’t include any negative values; the problem lay in the data and not in her code.

Chen noted that the time constraints of the CHILE program forced the group to be “more productive and to try out different ideas very quickly.”

“You have to be able to pick a simple, easy solution and push it forward, and if you have time, come back to improve it,” said Protopapas. The students had to let go of their desire to find a perfect solution. Practicing this type of problem-solving provides preparation for research in academia and industry, where time and resource constraints make the search for a perfect solution impossible.

The students completed their final results during their trip back to the US and sent their poster outlining the project to be printed just in time for presentation at the IACS data science symposium, “Weathering the Data Storm.”

Protopapas was “quite impressed by the students’ training and preparation, given that most of them were in their first or second year of graduate study. The courses here prepared them well to approach a real-world program. They knew what to do, and they were ready to go.”

Trip to the Cerro Tololo Inter-American Observatory

Group photo in front of telescope

Team CHILE posing for a group photo in front of the observatory

After a week in Santiago working together on the problem, the students traveled to the Cerro Tololo Inter-American Observatory (CTIO) in the Atacama Desert in northern Chile to see the telescopes and the DECam that produced their data. Meeting with the scientists who collected the data they were working on was a highlight of the trip. Dr. Christopher Smith, AURA Director in Chile, gave the students and faculty a demonstration of the telescope’s automatic laser calibration system.

Learning how the Dark Energy Camera worked helped the students understand the source of the noise in the data. The DECam data contained a lot of cosmic ray streaks, and at the observatory Yang discovered that her initial instincts about how to filter out this noise were incorrect. “If you don’t talk to the scientists, you just use your own way and you might miss certain results,” she said. “Knowing the whole process of data collection helps you build certain parts of the model. The assumptions used in the model are crucial because they influence the conclusion. Without field knowledge, we cannot do good data analysis.”

Smith also discussed the challenges astronomers experience in their attempts to collect their data. The students learned that even in the desert, astronomers must sometimes wait a long time to schedule their observations. These conversations made the exercise of wrestling with messy data take on increased importance. As Chen said, “We cannot fully be dependent on designing our own way of observing certain things. We have to build models that are able to combine data sets that are not exactly what you want but are relevant and extract certain information from them.”

For Chen, visiting the observatory was most exciting part of the trip. Even with the naked eye on a cloudy night, the view of the stars in the high-altitude desert was stunning. Next year, the CHILE Program hopes to arrange for the students to have the experience of actually conducting observations with the scientists.

Chilean Immersion

Students in Valparaiso

Students exploring colorful Valparaiso

Despite their grueling work schedule, the students managed to find time to explore. The DRCLAS Regional Office in Santiago provided invaluable assistance with visitor arrangements, presented the Harvard students with an orientation to Chile, and served as a resource for the Harvard students throughout their stay. The DRCLAS Office also organized a day-trip to Valparaiso, including a seaside dinner with Harvard alumni living in Santiago.

The immersion in Chilean culture helped some students strengthen their Spanish skills. Long days working together coupled with coffee and ice cream breaks and late-night outings to sample Chile’s famous pisco sours brought the students together quickly. The multinational group included Chilean students, Chinese students, US students, and one student from Bolivia. Harvard students learned from their group mates about the day-to-day life of Chilean and Bolivian people, the history of Chile, and even some Chilean slang. The Harvard students were touched by the warm welcome they received from their Chilean hosts, and they are still in touch with the Chilean students, some of whom plan to visit Harvard this summer.

Strengthening Harvard-Chile Research Collaboration        

Another goal of the program was to strengthen research collaboration between Chilean universities and Harvard. Harvard SEAS and the University of Chile’s Center for Mathematical Modeling signed a Memorandum of Understanding in the fall, committing to joint research activities and the exchange of faculty, graduates, and students. Although Chilean faculty and students had visited IACS, the CHILE Project enabled Harvard engineering students to travel to the University of Chile for the first time.

During the trip, Protopapas divided his time between mentoring the students, attending research meetings, and presenting a talk to the Bioinformatics Group on High Performance and Large Data at UChile.

He also met new collaborator Mario Hamuy, UChile Professor of Astronomy, who invited Protopapas to participate in the Chilean government-funded Millennium Institute of Astrophysics (MAS). The next round of activities include the Second La Serena School for Data Science (August 15-22) and the Astroinformatics' 2014 Conference (August 25-29), and IBERAMIA'2014 (November 24-27), a conference in Chile on Machine Learning. Protopapas is applying for support to enable Chilean student researchers’ travel to Harvard and to continue training together the next generation of data scientists.

To learn more about the students’ research, view team screencasts and code on the CHILE Project web page: