CMPSCI 645: Database Design and Implementation

Course mini-project [Due: May 13]

The mini-project is a collaborative assignment, which is due at the end of the semester (May 13). Students will be working in teams of 3-4. You will be working on reproducing partial results from a research paper, selected from the list we provide below.

  • Step 1: Self-organize in teams of 3-4. We suggest using Campuswire to connect with other students if you need help finding team mates. It is your responsibility to self-organize in a team in a timely fashion.
  • Step 2: Please select one of the mini-project options below to work on with your team. The mini-project has a single deliverable at the end of the semester. It is your responsibility to pace your work appropriately.
  • Step 3: Complete the reproducibility tasks for your chosen project. To reproduce the research successfully, you need to carefully read and understand the corresponding paper. Your implementation should be you own! You are not allowed to use code related to this research that you found online or that you may have obtained from others.
  • Step 4: Report on your findings. You will need to submit a short report detailing your process, and show-casing your reproducibility results. Some of the results may not be perfectly reproducible for a variety of reasons (e.g., changes in the datasets, differences in parameter tuning, etc); that's OK! Feel free to (optionally) include in your report additional results and steps you took, beyond the required reproducibility tasks.
  • Step 5: Upload your report on Gradescope. The deadline is May 13, 11:59pm. No late submissions will be accepted.

Project 1: A formal approach to finding explanations for database queries

Reproducibility tasks: Read the paper and understand the methodology. You will need to implement Algorithm 1, and reproduce the results of Figures 2 and 15. Your results may differ somewhat from the ones in the paper, as the datasets have changed since publication.

Data: The results this mini-project aims to reproduce use the DBLP dataset, which you have worked with in prior assignments, and the GeoDBLP dataset. Please refer to the paper for details on the datasets and how they were used in this research.

Project 2: Explore-by-example: an automatic query steering framework for interactive data exploration

Reproducibility tasks: Please read the paper and understand the overall approach and evaluation methodology. For each target query chosen for evaluation, please start with 3 positive examples and 3 negative examples to build the initial classification tree. Then you are asked to implement the techniques in Section 4 (Misclassified Exploitation) and Section 5 (Boundary Exploitation). Please reproduce the results of Figure 8(d).

Data: The results that this mini-project aims to reproduce will be based on a 100k tuple set from the Sloan Digital Sky Survey (SDSS) database. The dataset is available here.

Project 3: SeeDB: efficient data-driven visualization recommendations to support visual analytics

Reproducibility tasks: Please read the paper and understand the overall approach and evaluation methodology. In particular, implement the algorithm based on the definition in Section 2, Shared-based Optimization (through query rewriting) in Section 4.1, and Pruning-based Optimization (using Hoeffding-Serfling inequality) in Section 4.2. In evaluation, use the census data set. Set the user-specified query to include the married people, and the reference query to include unmarried people. Use the K-L Divergence as the utility measure. Find top-5 aggregate views by the utility measure. The plots should look like those illustrated in Figure 1.

Data: Census data set