April 20, 2018

Interactive Visual Discovery in Event Analytics: Electronic Health Records and Other Applications

Speaker: Ben Shneiderman, Professor of Computer Science, University of Maryland--College Park

Event Analytics is rapidly emerging as a new topic to extract insights from the growing set of temporal event sequences that come from medical histories, e-commerce patterns, social media log analysis, cybersecurity threats, sensor nets, online education, sports, etc. This talk reviews a decade of research on visualizing and exploring temporal event sequences to view compact summaries of thousands of patient histories represented as time-stamped events, such as strokes, vaccinations, or admission to an emergency room.

Dr. Shneiderman’s work on EventFlow supports point events, such as heart attacks or vaccinations and interval events such as medication episodes or long hospitalizations. Demonstrations cover visual interfaces to support hospital quality control analysts who ensure that required procedures were carried out and clinical researchers who study treatment patterns that lead to successful outcomes. He will show how domain-specific knowledge and problem-specific insights can lead to sharpening the analytic focus so as to enable more successful pattern and anomaly detection.

Co-sponsored with the Harvard Data Science Initiative (HDSI).

April 6, 2018

Dean's Lecture on Computational Science & Engineering (CSE)

Taking the Universe's Baby Picture

Speaker: David Spergel, Charles Young Professor of Astronomy, Princeton University & Founding Director, CCA, Flatiron Institute

Images of the cosmic microwave background, the leftover heat from the big bang, are the universe’s baby picture. Embedded in this picture is information about the universe’s age, origin, composition and fate. Our observations have revealed a remarkably simple, yet strange universe. A simple model with only five basic numbers can describe the basic statistical properties of the universe which describes the positions and properties of billions of galaxies and millions of independent points on the sky. While the model is simple, it implies that atoms make up only 5% of the universe and bulk of the universe is composed of mysterious dark matter and dark energy. Dr. Spergel will review past measurements and look forward to future observations that could determine the properties of the dark energy and deepen our understanding of the universe’s beginnings and ultimate fate.

March 23, 2018

Data Science and Our Environment

Speaker: Francesca Dominici, Professor of Biostatistics, HSPH & Co-Director of the Harvard Data Science Initiative (HDSI)

What if I told you I had evidence of a serious threat to American national security – a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today – but the threat doesn’t come from terrorists. The threat comes from climate change and air pollution.

Researchers have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels across the continental U.S., breaking the country up into 1-square-kilometer zones. They have paired that information with health data contained in Medicare claims records from the last 12 years, and for 97% of the population aged 65 or older. They have also developed statistical methods and computational efficient algorithms for the analysis over 460 million health records.

Their research shows that short and long term exposure to air pollution is killing thousands of senior citizens each year. Their data science platform is telling us that federal limits on the nation’s most widespread air pollutants are not stringent enough.

This type of data is the sign of a new era for the role of data science in public health, and also for the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. Dr. Dominici will discuss these and other challenges.

March 2, 2018

Data Science Toward Understanding Human Learning and Improving Educational Practice

Speaker: Ken Koedinger, Professor of Human–Computer Interaction and Psychology, Carnegie Mellon University

Big data and machine learning appear to be revolutionizing many fields. Is education one of them? Unlike our universe or the quantum structure of particles, how people learn is a question that seems much closer to our direct observation. So close, one might wonder why data is needed and whether self-reflection is sufficient to understand learning. Koedinger’s first goal is to convince you that self-reflection is not sufficient. His second is to provide you with examples of educational data mining and how it has provided insights into how people learn (e.g., slowly and incrementally) and fostered improvements in human learning outcomes (e.g., 2x more effective learning). Koedinger will emphasize that explanatory models of data are critical for such insights and outcomes and that disciplinary expertise, but not just data science, must be brought to bear. He will illustrate the role of disciplinary expertise in the psychology of learning and in the educational subject-matter domain, and the role of explanatory models in the form of symbolic computational models of learning that can be taught competencies like algebra, grammar, and chemistry.

February 16, 2018

Challenges and Considerations in Search Quality

Speaker: Isabelle Stanton, Software Engineer, Google

Google Search is one of the most widely used data products in the world. While it has been in constant development since 1997, Search is by no means a solved problem. Even the question of what a good set of search results for a query has a constantly evolving answer. Dr. Stanton’s talk will focus on some of the challenges Search faces - defining quality metrics, dealing with noisy and sparse data, design choices for a data system at this scale, as well as what to do when you just don't like the output of a system.

January 26, 2018

Geometric Deep Learning on Graphs and Manifolds: Going Beyong Euclidean Data

Speaker: Michael Bronstein, Radcliffe Institute, Università della Svizzera italiana (Switzerland), and Tel Aviv University (Israel)

In the past decade, deep learning methods have achieved unprecedented performance on a broad range of problems in various fields from computer vision to speech recognition. So far research has mainly focused on developing deep learning methods for Euclidean-structured data. However, many important applications have to deal with non-Euclidean structured data, such as graphs and manifolds. Such data are becoming increasingly important in computer graphics and 3D vision, sensor networks, drug design, biomedicine, high energy physics, recommendation systems, and web applications. The adoption of deep learning in these fields has been lagging behind until recently, primarily since the non-Euclidean nature of objects dealt with makes the very definition of basic operations used in deep networks rather elusive. In this talk, Dr. Bronstein will introduce the emerging field of geometric deep learning on graphs and manifolds, overview existing solutions and applications, and outline the key difficulties and future research directions.

November 17, 2017

Using Knockoffs to Find Important Variables with Statistical Guarantees

Speaker: Lucas Janson, Assistant Professor in Statistics, Harvard University

Many contemporary large-scale applications, from genomics to advertising, involve linking a response of interest to a large set of potential explanatory variables in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively select important variables while controlling the fraction of false discoveries, even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, Dr. Janson and his colleagues propose a new framework of model-X knockoffs, which reads from a different perspective the knockoff procedure (Barber and Candès, 2015) originally designed for controlling the false discovery rate in linear models. Model-X knockoffs can deal with arbitrary (and unknown) conditional models and any dimensions, including when the number of explanatory variables p exceeds the sample size n. Their approach requires the design matrix be random (independent and identically distributed rows) with a known distribution for the explanatory variables, although they show preliminary evidence that their procedure is robust to unknown/estimated distributions. As they require no knowledge/assumptions about the conditional distribution of the response, they effectively shift the burden of knowledge from the response to the explanatory variables, in contrast to the canonical model-based approach which assumes a parametric model for the response but very little about the explanatory variables. To their knowledge, no other procedure solves the controlled variable selection problem in such generality, but in the restricted settings where competitors exist, they demonstrate the superior power of knockoffs through simulations. They have applied their procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.

This is joint work with Emmanuel Candes at Stanford and Yingying Fan and Jinchi Lv at USC.

November 10, 2017

Extreme Scale Computing, Big Data Science and Web of Life Network Science

Speaker: Manju Manjunathaiah, Lecturer on Computation, Harvard University

The first part of Professor Manjunathaiah’s talk will explore two leading formal models of concurrency in computer science, the polyhedral and CSP, as a distinct approach to extreme scale computing.  In the second part, he will present three grand challenge areas as exemplars of extreme scale big data science: environmental science (climate modelling), genomics life science (tree of life) and computational neuroscience (deep learning).  Here the underlying scaling characteristics are energy-efficiency, resilience and predictive capability.  Prof. Manjunathaiah will highlight some current research which explores distance minimisation, self-organising and asynchronous data flow computational principle for extreme scale data science.

The final part of the talk is a curiosity driven exploratory research under the “theoretical computer science meets biological phenomena” premise that radical advancements in deep measurements of all life on this planet is bringing two grand biological phenomena into the realms of computer science and with deep computations at the extreme scales offers new avenues for a big data science from productive cross-collaboration between the sciences.  Professor Manjunathaiah will highlight some computational principles to investigate this grand goal of modelling the eco-system continuous dynamics of “web of life” to account for “information domain” (network dynamics) of biological phenomena along with matter and energy. Networks permeate all scales of life — from genes to the web of life.

Dr. Manjunathaiah's presentation slides can be found here.

November 3, 2017

Adventures in Analytics

Speaker: Bob Rogers, Chief Data Scientist for Analytics and AI, Intel

The world is an amazing place for data scientists. Bob Rogers, Chief Data Scientist for Analytics and AI at Intel Corporation, will describe his experiences as a leader in analytics and AI. He will share his perspective on what makes a great data scientist, how he defines data science, analytics and Artificial Intelligence, real insights into the day to day life of a data scientist, and an overview of the model creation pipeline. Bob believes that the opportunities to apply advanced analytics to improve the lives of people are boundless, and will demonstrate his work with the “Intel Inside, Safer Children Outside” program, which applies analytics and AI to fight child sex trafficking and child exploitation online.

October 27, 2017

Reinforcement Learning for Healthcare

Speaker: Finale Doshi-Velez, Assistant Professor of Computer Science, Harvard University

Many healthcare problems require thinking not only about the immediate effect of a treatment, but possible long-term ramifications. For example, a certain drug cocktail may cause an immediate drop in viral load in HIV, but also cause the presence of resistance mutations that will reduce the number of viable treatment options in the future. Within machine learning, the reinforcement learning framework is designed to think about decision-making under uncertainty when decisions may have long-lasting effects. However, translating these formalisms to real settings with messy, partial data creates many challenges. Prof. Doshi-Velez will discuss innovations in her research group to apply these paradigms to real problems in healthcare: treating patients with sepsis and managing patients with HIV.

October 20, 2017

Theory Methods to Describe Transport and Dynamics in Quantum Materials

Speaker: Prineha Narang, Assistant Professor of Computational Materials Science

There is consensus in the field that in the post-Moore’s law era of electronics, there is a critical need to understand ultrafast dynamics of materials, non-equilibrium transport and discover new quantum-engineered materials to design devices of the future. In this context Dr. Narang will share her research group’s recent computational work in two interconnected areas: quantum materials-by-design, including electron-electron and electron-phonon calculations in van der Waals heterostructures, and a new far-from-equilibrium transport method, applied to faceted nanostructures. Narang will also present some ideas in defects as engineered quantum emitters to surpass the vacancy centers in diamond that her group is working on.

September 15, 2017

Machine Learning for Small Business Lending

Speaker: Thomson Nguyen, Head of Data Science, Square Capital

In the era of Electronic Health Records, it is possible to examine the outcomes of decisions made by doctors during clinical practice to identify patterns of care—generating evidence from the collective experience of patients. We will discuss methods that transform unstructured EHR data into a de-identified, temporally ordered, patient-feature matrix.  We will also review use-cases, which use the resulting de-identified data, for pharmacovigilance, to reposition drugs, build predictive models, and drive comparative effectiveness studies in a learning health system.

September 8, 2017

Big Data Software: What's Next?

Speaker: Mike Franklin, Chairman, Department of Computer Science, University of Chicago

Starting a business is hard--at least 65% of small businesses in the United States fail in their first five years of operation. Among the biggest reasons cited for business failure is a lack of working capital to get started or to scale. In this talk, Nguyen will share his team's current work in machine learning on small business loan eligibility as it pertains to credit default risk mitigation, as well as challenges and opportunities in the lending space with some of the more esoteric ML approaches (e.g. why a deep learning black box isn't going to cut it.)