CS 494: Cloud Data Center Systems

Key	Papers	Student Email(s)
Architecture: Storage	The Hadoop Distributed File System, Schvachko et al, MSST, 2010.	[]
Storage: FDS	Flat Datacenter Storage. Nightingale et. al, OSDI, 2012.	[sjamal7, psingh56]
Storage: Bigtable	Bigtable: A Distributed Storage System for Structured Data. Chang et. al, OSDI, 2006	[spani2, bjain6]
Storage: Dynamo	Dynamo: Amazon’s Highly Available Key-value Store. DeCandia et. al, SOSP, 2007.	[sjamal7, lchen79]
Storage: Spanner	Spanner: Google’s Globally-Distributed Database. Corbett et. al, OSDI, 2012.	[psingh56, lchen79]
Storage: MemcachedFacebook	Scaling Memcache at Facebook. Nishtala et. al, NSDI, 2013.	[sjamal7, psingh56]
Execution: MR	MapReduce Simplified Data Processing on Large Clusters, Dean and Ghemawat, OSDI, 2004.	[spani2, bjain6]
Execution: Dryad	Dryad:Distributed Data-Parallel Programs from Sequential Building Blocks. Isard et. al, EuroSys, 2007.	[psingh56, dporte7, srawat5]
Execution: CIEL	CIEL: a universal execution engine for distributed data-flow computing. Murray et. al, NSDI, 2011.	[psingh56]
Execution: DryadLINQ	DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Yu et. al, OSDI, 2008.	[kshind2, wtoher2]
ResourceNeg: YARN	Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al, SOCC, 2013.	[cmonta9, avenka35]
ResourceNeg: Borg	Borg: Large-scale cluster management at Google with Borg. Verma et. al, EuroSys, 2015.	[lchen79]
ResourceNeg: Mesos	Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, Hindman et al, NSDI, 2011.	[wtoher2]
ResourceNeg: DRF	Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, Ghodsi et al, NSDI, 2011.	[psingh56]
Scheduling: Packing	(Carbyne:) Altruistic Scheduling in Multi-Resource Clusters.	[psingh56]
Scheduling: Packing	Quincy: Fair Scheduling for Distributed Computing Clusters. Isard et. al, SOSP, 2009.	[psingh56]
Execution: Spark	Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al, NSDI, 2012.	[lchen79]
Execution: LoadBalancing1	Ananta: Cloud Scale Load Balancing. Patel et. al, SIGCOMM, 2013.	[sjamal7]
Execution: LoadBalancing2	Duet: Cloud Scale Load Balancing with Hardware and Software. Gandhi et. al, SIGCOMM, 2014.	[sjamal7]
Execution: LoadBalancing3	SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs, Miao et. al, SIGCOMM, 2017.	[sgadho2, dporte7, srawat5]
SQL: SparkSQL	Spark SQL: Relational Data Processing in Spark, Armburst et al, SIGMOD, 2015.	[spani2, bjain6, sgadho2]
SQL: Hive	Major technical advancements in Apache Hive, Huai et al, SIGMOD, 2014.	[spani2, bjain6, sgadho2]
SQL: Impala	Impala: A Modern, Open-Source SQL Engine for Hadoop. Kornacker et. al, CIDR, 2015.	[avenka35]
SQL: Trill	Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Chandramouli et. al, VLDB, 2014.	[sgadho2]
GeoDistributed: Clarinet	Clarinet: WAN-Aware Optimization for Analytics Queries, Viswanathan et al, OSDI, 2016.	[sgadho2]
Streaming: Storm	Storm @Twitter , Toshniwal et al, SIGMOD, 2014.	[ wtoher2]
Streaming: Heron	Twitter Heron: Stream Processing at Scale, Kulkarni et al, SIGMOD, 2015.	[wtoher2]
Streaming: FacebookStreaming	Realtime Data Processing at Facebook. Chen et. al, SIGMOD, 2016.	[cmonta9]
Streaming: SparkStreaming	Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013.	[sgadho2]
Streaming: Flink	Apache Flink: Stream and Batch Processing in a Single Engine, Carbone et al, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015.	[wtoher2]
Streaming: Drizzle	Drizzle: Fast and Adaptable Stream Processing at Scale. Venkataraman et. al, SOSP, 2017.	[sgadho2]
Streaming: Gloss	Gloss: Seamless Live Reconfiguration and Reoptimization of Stream Programs. Rajadurai et. al, ASPLOS, 2018.	[wtoher2]
QMS: Kafka	Kafka Distributed Messaging System for Log Processing, Kreps et al, NetDB Workshop, 2011. Also read this comparison of widely used Queuing Messaging Processing Systems.	[lchen79, cmonta9]
Streaming: rStreams	StreamScope: Continuous Reliable Distributed Processing of Big Data Streams, Lin et al, NSDI, 2016.	[sgadho2]
Streaming: Dataflow	The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Akidau et. al, VLDB, 2015.	[spani2, bjain6, avenka35]
Streaming: Scaling	Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows. Kalavri et. al, OSDI, 2018.	[dporte7, srawat5, kshind2]
GraphProc: Pregel	Pregel: A System for Large-Scale Graph Processing, Malewicz et al, SIGMOD, 2010.	[dporte7]
GraphProc: TAO	TAO: Facebook’s Distributed Data Store for the Social Graph. Bronson et. al, USENIX ATC, 2013.	[srawat5, kshind2]
GraphProc: PowerGraph	PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al, OSDI, 2012.	[dporte7, srawat5, kshind2]
GraphProc: GraphX	GraphX: Graph Processing in a Distributed Dataflow Framework, Gonzalez et al, OSDI, 2014.	[avenka35]
GraphProc: RDF	Fast and Concurrent RDF Queries with RDMA-based Distributed Graph Exploration. Shi et. al, OSDI, 2016.	[avenka35]
GraphProc: Facebook	One Trillion Edges: Graph Processing at Facebook-Scale. Ching et. al, VLDB, 2015.	[spani2, bjain6, cmonta9]
Social: FacebookAnalytics	Data warehousing and analytics infrastructure at Facebook. Thusoo et. al, SIGMOD, 2010.	[spani2, sjamal7, bjain6]
Social: FacebookPhoto	Finding a needle in Haystack: Facebook’s photo storage. Beaver et. al, OSDI, 2010.	[spani2, sjamal7, cmonta9]
Social: Unicorn	Unicorn: A System for Searching the Social Graph. Curtiss et. al, VLDB, 2013.	[bjain6]
Monitor: Scuba	Scuba: Diving into Data at Facebook. Abraham et. al, VLDB, 2013.	[lchen79]
Video: SVE	SVE: Distributed Video Processing at Facebook Scale. Huang et. al, SOSP, 2017.	[cmonta9]
Runtime: Weld	Weld: A Commom Runtime for High Performance Data Analytics, Palkar et al, CIDR, 2017.	[]
Serverless: OpenLambda	Serverless Computation with OpenLambda. Hendrickson et. al, HotCloud, 2016.	[sjamal7, lchen79, avenka35]
Approx: BlinkML	BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. Park et. al, SIGMOD, 2019.	[kshind2, wtoher2]
RDMA: FaRM	FaRM: Fast Remote Memory. Dragojevic et. al, NSDI, 2014.	[kshind2, wtoher2]
RDMA: FastNetworks	Remote Memory in the Age of Fast Networks. Aguilera et. al, SoCC, 2017.	[]
ML: Facebook	Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et. al, HPCA, 2018.	[cmonta9, avenka35]
ML: TensorFlow	TensorFlow: A System for Large-Scale Machine Learning, Abadi et al, OSDI, 2016.	[lchen79, avenka35]
ML: TPU	In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et. al, ISCA, 2017.	[cmonta9, dporte7, srawat5]
Offload: iPipe	iPipe: A Framework for Building Datacenter Applications Using In-networking Processors. Liu et. al, 2018.	[dporte7, srawat5, kshind2]
Offload: Access	Direct Universal Access: Making Data Center Resources Available to FPGA. Shu et. al, NSDI, 2019.	[dporte7, srawat5, kshind2]

Overview

Every student needs to complete 8 reviews in total on papers. Your posting should be based on the following template. Before the paper is presented in class (i.e., before 12:29pm on the day of class), your group must post its review to the Course Discussion Website (Piazza).

Specifically, it must contain:

A one or two sentence summary of the paper
A description of the problem they were trying to solve and why this is an important problem.
A summary of the contributions of the paper. What is the hypothesis of the work? What is the proposed solution, and what key insight guides their solution?
What did the paper evaluate and was it appropriate?
What is one (or more) drawback or limitation of the proposal, and how will you improve it?
At least one thing about the paper you would like to discuss in class

The writeup should not be (significantly) more than a page in length. Late write-ups will receive a zero grade.

Advice

The reading schedule for this course will be intense. Even though not every student is required to submit a review for every paper, every student is expected to read every paper.

When reading and discussing each paper (on Piazza), you are encouraged to consider the following questions: As you read, here are some questions you should consider:

What problem are the authors trying to solve?
- Why was the problem important?
- Why was the problem not solved by earlier work?
What is the authors solution to the problem?
- How does their approach solve the problem?
- How is the solution unique and innovative?
- What are the details of their solution?
How do the authors evaluate their solution?
- What specific questions do they answer?
- What simplifying assumptions do they make?
- What is their methodology?
- What are the strengths and weaknesses of their solution?
- What is left unknown?
What do you think?
- Is the problem still important?
- Did the authors solve the stated problem?
- Did the authors adequately demonstrate that they solved the problem?
- What future work does this research point to?

You should be prepared to discuss these questions in class. For each paper I will ask for a volunteer to summarize and address a few of these questions in class.

Here are a few links to advice on reading papers:

Efficient reading of papers in science and technology by Michael J. Hanson
The Many Faces of Systems Research – And How to Evalute Them by Aaron Brown, Anupan Chanda, Rik Farrow, Alexandra Fedorova, Petros Maniatis, and Michael Scott
Advice on understanding related work by Jennifer Mankoff
Research Papers and Review Considerations by David Wetherall

Writeup Grading

What I’m looking for:

Does the review include all sections (summary, problem, contributions, flaws, topic question)
Are all assertions backed up (e.g. “X is a bad idea” is not acceptable, but “X is a bad idea because Y”) is acceptable
Is the review concise? The summary should be a few sentences and give the essence of the design in the paper, not the problem. (E.g., “This paper is about how to build a multiprocessor operating system” is not acceptable, but “This paper is about building a multiprocessor operating system by layering abstractions that mask the existence of multiple processors” is acceptable)
Did the student understand the material? Are there factual flaws in the review? For example, if the paper defines a term, does the student use it appropriately? As another example, if students state that a paper is relevant because modern operating systems do things the same way, is that true?
Did the student consider whether the evaluation is sufficient? Does it show that the work doesn’t harm regular programs, even if it works well for some programs? Do they evaluate all the goals for the system?

Assigning grades:

If the review does an excellent job on all four considerations, and provides genuinely insightful comments about the problem, contributions (going beyond what the paper claims are contributions), evaluation, confusions, it should receive a check plus
If two or more of the four criteria are not met, the review should receive a check minus
Otherwise, it should receive a check. A check plus is worth 1 point, a check is 3/4 point, a check minus is 1/2 point, and not turning a review in is worth zero points.