Stephen Merity (S.M. ’14) has moved out West and he’s loving it. Less than a month after graduating from the CSE master's program, Merity has found himself at Common Crawl Foundation, a nonprofit that builds and maintains an open crawl of the web, in “the physical manifestation of the Internet,” San Francisco.
Originally hailing from Australia, the University of Sydney alum (Bach. of Information Tech. ’11) is a head engineer/data scientist and helping the company achieve its goal of providing data to anyone who wants it completely free of charge.
“Each month we aim to release a new web snapshot that is over 200 terabytes in size, ranging between 2 and 4 billion pages. I'm working on simplifying the tools to process Common Crawl data, in addition to continuing to create the Common Crawl datasets,” Merity says.
Merity discovered Common Crawl when he began working with the dataset as part of his final project for CS 205: Computing Foundations for Computational Science. The opportunity gave him time to play with the data and “interact with a complex problem… that ran at scale,” and led him to reach out for professional roles.
Though Merity misses “the intense (late-night) discussions with classmates” – something that is difficult to do in the working world, where people “end up distracted with the individual task they’re working on” – he’s quickly adjusted to the temperate weather in sunny California. In the past three months, his exposure to Silicon Valley has ranged from engaging with local tech-centric universities, Stanford University and UC Berkeley, to playing XBox360 with the chief scientist at Kaggle.
At CommonCrawl, the work is intellectually challenging and keeps Merity and his co-workers on their toes. “When I'm performing a crawl, there are two scary things happening. The first is that we're burning over a hundred dollars a day on AWS instances. The second is that our machines are downloading millions of pages per hour from servers all across the web. If I make a mistake, or a bug flairs up, seriously bad things could happen very quickly.”
“One of the biggest benefits [though] is the brilliant network of volunteers and advisors we have [at CommonCrawl],” Merity notes. “Despite the challenges, given that we crawl billions of pages a month and have next to zero complaints, we’ve done a good job so far.”