April 25, 2014

Host-based Online Behavioral Malware Detection and Classification

Speaker: Spiros Mancoridis, Professor of Computer Science & Senior Associate Dean of Computing, Drexel University

The complex computing systems employed by governments, corporations, and other institutions are frequently targeted by cyber-attacks designed for espionage and sabotage. The malicious software used in such attacks are typically custom-designed or obfuscated to avoid detection by traditional antivirus software. Our goal is to create a malware detection and classification system that can quickly and accurately detect and classify such malware. We pose the problem of malware detection as a multi-channel change-point detection problem, wherein the goal is to identify the point in time when a system changes from a known clean state to an infected state. 

In this talk, I will present a host-based malware detection system designed to run at the hypervisor level, monitoring hypervisor and guest operating system sensors and sequentially determining whether the host is infected. I will also describe an automatic classification system that can be trained to accurately identify new variants within known malware families, using observed similarities in behavioral features extracted from sensors monitoring live computers hosts. A case study wherein the detection system is used to detect various types of malware on an active web server under heavy computational load will be presented.


April 11, 2014

Expressing yourself in R

Speaker: Hadley Wickham, Chief Scientist of RStudio; Assistant Professor of Statistics, Rice University

There are three main time sinks in any data analysis:
1. Figuring out what you want to do.
2. Turning a vague goal into a precise set of tasks (i.e. programming).
3. Actually crunching the numbers.

A well-designed domain specific language (or DSL) tightly coupled to the problem domain can make all three pieces faster. In this talk, I’ll discuss two DSLs built in R: ggvis for visualisation and dplyr for data manipulation. These build on my previous packages ggplot2 and plyr, improving both expressivity and speed.

Data visualisation and manipulation are key parts of data analysis. ggvis makes it easy to declaratively describe interactive web graphics. It combines a declarative syntax based on ggplot2 with shiny’s reactive programming model and vega’s declarative JS rendering system. dplyr implements the most important verbs of data manipulation in a datastore-agnostic fashion, so you can think about and compute with your data in the same way regardless of whether you’re working with a local in-memory data frame or a remote on-disk database.


April 4, 2014

Information Diffusion Through Adaptive Seeding

Speaker: Yaron Singer, Assistant Professor of Computer Science, Harvard SEAS

In recent years social networking platforms have developed into extraordinary channels for spreading and consuming information. Along with the rise of such infrastructure, there is continuous progress on techniques for spreading information effectively through influential users.

In this talk we will introduce a new paradigm for optimizing information diffusion processes, called Adaptive Seeding. The framework is designed to leverage a remarkable phenomenon in social networks, related to the "friendship paradox" (or "your friends have more friends than you"). We will discuss this structural phenomenon and present key algorithmic ideas and fundamental challenges of Adaptive Seeding.


March 28, 2014

Effective Crowd-Sourcing

Speaker: Devavrat Shah, Associate Professor in the Department of Electrical Engineering and Computer Science, MIT

Crowd-sourcing systems provide means to harness human ability at a large-scale to solve a variety of problems effectively. Examples abound of classical surveys for collecting opinion of a group to the modern setting of social recommendations. In this talk, we shall discuss effective ways to design crowd-sourcing experiments as well as aggregate the information collected. In the context of Mechanical Turk framework, this leads to automated approach for getting a task done at the minimum possible cost. Time-permitting, different variations of the theme will be discussed.  This is based on joint work with D. Karger (MIT) and S. Oh (UIUC).


March 14, 2014

Too Many Data and Too Few Parameters: Mapping and Analysing the Whole Universe, A Challenge in Computational Science

Speaker: Raul Jimenez, ICREA Chair in Cosmology and Theoretical Physics, University of Barcelona

There is only one sky to observe and thus, eventually, all of it will be stored in our computers.  This will be achieved in the next few decades. However, even if finite, there is too much information in the sky; the challenge is how we mine that data set to learn the fundamental laws of nature. In my talk, I will describe this challenge and explain the need to develop more sophisticated statistical and computational tools to achieve the goal of extracting all the information form our Universe. 


March 7, 2014

Quantifying Collective States from Online Social Networking Data

Speaker: Johan Bollen, Associate Professor at the Indiana University School of Informatics and Computing

A significant fraction of humanity is now engaged in online social networking. Hundreds of millions of individuals publicly volunteer information on the most minute details of their experiences, thoughts, opinions, and feelings in real-time. This information is communicated largely in the form of very short texts, necessitating the development of text analysis algorithms that can determine psychological states from the content of sporadic and poorly formatted text updates. In this presentation I will provide an overview of the rapidly evolving domain of computational social science, which along with web science, is making significant contributions to our understanding of collective decision-making and social psychology.  In our research we have developed tools to determine the collective states of online communities from large-scale social media data and have related these measurements to a variety of socio-economic indicators. We have shown how fluctuations of collective mood states match trends in the financial markets, used longitudinal data on the fluctuations of individual mood states to study mood homophily in social networks, and investigated how measures of online attention may yield new indicators of scholarly communication.


February 28, 2014

Soon Everyone Will Be a "Data Scientist" or "Data Explorer." What Data Systems Will They Be Using?

Speaker: Stratos Idreos, Assistant Professor of Computer Science, Harvard SEAS

How far away are we from a future where a data management system sits in the critical path of everything we do? Already today we need to go through a data system in order to do several basic tasks, e.g., to pay at the grocery store, to book a flight, to find out where our friends are and even to get coffee. Businesses and sciences are increasingly recognizing the value of storing and analyzing vast amounts of data. Other than the expected path towards an exploding number of data-driven businesses and scientific scenarios in the next few years, in this talk we also envision a future where data becomes readily available and its power can be harnessed by everyone. What both scenarios have in common is a need for new kinds of data systems which are tailored for data exploration, which are easy to use, and which can quickly absorb and adjust to new data and access patterns on-the-fly. In this talk, we will discuss this vision as well as recent and on going advances towards systems which are tailored for data exploration.


February 14, 2014

Dean's Lecture on Computational Science
Fast, Accurate Tools for Physical and Biophysical Modeling

Speaker: Leslie Greengard, Simons Foundation and Courant Institute, New York University

During the last few years, new algorithms have brought large-scale physical and biophysical modeling within practical reach. The fast multipole method, for example, permits the rapid evaluation of all pairwise interactions in systems of charged particles.  If N denotes the number of particles in the system, the cost grows linearly with N rather than quadratically, with million-fold speedups obtained in systems with one billion charges. This is critical, for example, in simulations of electromagnetic phenomena, molecular dynamics, ion channels, and protein-protein interactions. I will give an overview of these methods and their applications, followed by a brief and somewhat speculative discussion of current opportunities at the intersection of applied mathematics, physics, and biology in microscopy, biomedical imaging, systems biology and biophysics.


January 31, 2014

Data Science and Design: Fickleness and How We Solve It

Speaker: Bo Peng, Data Scientist, Data Scope Analytics

When solving problems, data scientists often encounter added layers of complexity when the problems to be solved are not well defined, and their solutions unclear. In these cases, standard, more straightforward approaches

fall short, as they are not amenable to vague problems, and are thus not guaranteed to reliably produce useful results. At Datascope Analytics, we adopt methodologies from the design community and use a "continuous feedback loop" to iteratively improve dashboards, algorithms, and data sources to ensure that the resulting tool will be useful and well received. During this talk, I will illustrate our approach by sharing a detailed example from one of our projects. I will end by showing a live demo version of our final visualization tool, using movie data from the Internet Movie Database (IMDB).


November 22, 2013

Deep Learning for Distribution Estimation

Speaker: Hugo Larochelle, Assistant Professor of Computer Science, Universite de Sherbrooke, Canada

Deep learning methods attempt to learn a deep and distributed representation of data directly from its low-level representation. The motivating argument is that high-dimensional data in AI-related domains (speech, computer vision, natural language) can take a more meaningful representation as a decomposition into several layers of abstractions, decomposing its different factors of variation. Deep learning methods thus try to discover and learn this representation directly from data.

In this talk, I will first discuss the basic concepts and methods behind deep learning, reviewing in particular the impressive advancements to the state-of-the-art it has recently permitted in speech recognition and visual object recognition.

I will then present my recent research on using neural networks for the task of distribution/density estimation, a fundamental problem in machine learning. Specifically, I will discuss the neural autoregressive distribution estimator (NADE), a state-of-the-art estimator of the probability distribution of data. I will also describe a deep version of NADE, which again illustrates the statistical modelling power of deep models.


November 8, 2013

Probabilistic Programming and Probability Processing

Speaker: Ben Vigoda, Director, Analog Devices Lyric Labs

We are developing a computing stack for Bayesian inference and machine learning, including integrated circuits, probabilistic programming languages, compilers, and applications. Our first probability processor hardware demonstrates orders of magnitude wins on machine learning and statistical inference benchmarks. We are developing open-source probability programming languages that help enable rapid prototyping and development of statistical machine learning applications.  We will demonstrate some applications that we are building on top of the probability processing stack.


October 25, 2013

10 Simple Rules for the Care and Feeding of Scientific Data

Speaker: Mercè Crosas, Director of Data Science, Institute for Quantitative Social Science, Harvard

Increasingly, scientific publications and claims are based on ever-increasing volumes of data. Once the publication is complete, it is often difficult for others to locate the data and accompanying analyses, and once located, often challenging to make sense of them. For scientific results to continue being subject to verification and extension, we in the scientific community must ensure that good data management, with sufficient transparency and accessibility of data and analyses, become essential and ordinary elements of the research cycle. In this paper, we present 10 simple rules to help scientists towards this goal.


October 11, 2013

How the Brain Handles Big Data: Online Algorithms in Neurons

Speaker: Dmitri "Mitya" Chklovskii, Janelia Lab Head, Howard Hughes Medical Institute

Our brains constantly handle big data streamed by our sensory organs. Yet, how this is done in neurons, elementary building blocks of the brain, is not understood. We propose to view a neuron as a signal processing device representing its high-dimensional input by a synaptic weight vector scaled by its output. A neuron accomplishes this task by running two online algorithms: a slow algorithm which adjusts synaptic weights to extract the most non-Gaussian projection of the high-dimensional input, and a fast algorithm which estimates the projection amplitude. Both online algorithms rely on sparsity-inducing regularizers and have provable regret bounds. The steps of these algorithms account for the salient physiological features of neurons such as leaky integration, non-linear output function, Hebbian synaptic plasticity rules, and sparse connectivity and activity. Thus, our work should help model biological neural circuits and develop biologically inspired computing.


September 27, 2013

Prediction, Renaissance, and Cognition - 3 Questions for Computing

Speaker: Sadasivan Shankar, Senior Principal Engineer and Program Leader for Materials Design, Design and Technology Group, Intel Corp.; IACS Distinguished Scientist in Residence

With the increasing power of computing, humans appear to be on the verge of a golden era in use of computing to address problems in all areas including energy, health, and information. Extrapolating the ever-increasing efficacy of hardware and software, it appears that we are moving towards being totally predictive and even exceeding the computing power of the brain. Based on our work on several aspects of modeling covering areas of chemistry and materials science, we will address the feasibility of such a vision and look back to history and renaissance to distill the lessons for the future of computing. In this journey, we hope to take you back to the future in which prediction has been one of the most sought after goals for humans.


September 13, 2013 

Using Computation to Diagnose and Predict Heart Disease

Speaker: Efthimios Kaxiras, John Hasbrouck Van Vleck Professor of Pure and Applied Physics

The patterns of blood flow in arteries are crucial in determining the onset and progression of heart disease. These patterns can only be captured by simulations, assuming that the important details at different scales are properly described. This presentation will give an overview of our efforts to construct multiscale models of arterial blood flow based on the lattice Boltzmann equation.