In this project, you will be identifying, loading, and analyzing data you have identified and has been approved by the instructor. This project will be completed in several stages, the requirements of which are specified below. There will be large amounts of class time set aside for project work. Use this to your advantage. Ask questions and work through issues in class.
|Section||Possible Points||Due date|
Extra credit may be awarded for exceptional/thorough work.
Part 1: Project Proposal
Submit a short project proposal (approx. ½ page) that identifies a data set containing two or more tables and at least 3 non-trivial questions about the underlying data that you will attempt to answer using SQL. (Alternatively, if you are dying to work on data that is only a single table, you may come up with 5 non-trivial questions instead).
A trivial questions would be “How many rows are in the table?” or “How many taxi cab rides were on Sunday?”. A non-trivial question is “Is better it to drive a cab on the weekends or during the week?”, “Is there a relationship between the location of the cab rides and the tipping percentage?”, or “When should I drive to maximize my profitability?”.
The short write-up must include the following points:
1. What data are you going to analyze?
2. What features does your data contains?
3. Where did you find this data (include a url)?
4. What is the format of the data?
5. What questions are you going to attempt to answer with your data?
6. What are you going to use to produce your report?
It should NOT just be a numbered list of answers. Write it in narrative form.
Note: If there is a different project involving MySQL you would like to work on that is of similar caliber to the project outlined below (e.g. creating an interactive web application that’s tied to the database), talk to me and we might be able to work something out.
Where to look for data:
· Kaggle: https://www.kaggle.com/datasets (Links to an external site.)Links to an external site. (filter to csv file types)
· Google dataset search: https://toolbox.google.com/datasetsearch (Links to an external site.)Links to an external site.
· Anywhere else you find. Feel free to ask if there is something you are interested in but having trouble finding.
Part 2: Loading and preparing data for analysis
Write SQL code to reproducibly load all files needed for the project and to clean up any data that needs it. This may include creating a database, tables, keys, indexes, additional summary tables, or views. The goal is to produce a SQL file that will load the data without issue if given the data files and loading script.
This must include:
1. The SQL code to create all involved tables
2. SQL code to load you data into the tables created in step 1
3. Code to create appropriate indexes on any keys fields
4. Documentation of any cleanup that needed to be done before or after loading the files
Part 3: Data Profile and Analysis
Write SQL code and create tables and charts needed to clearly answer the approved questions. May sure to address any shortcomings in your analysis. If you would like, you can do the analysis in R or Python as long as the data is sourced from MySQL. It is also acceptable to write a report in Word. All SQL code should run and reproduce the results in the final write-up. Analysis should include the following:
1. Demonstration that the data was loaded correctly. This can be done by producing summary metrics on the various important columns (e.g. if this table is supposed to have every state in the USA, verify that it does).
2. SQL to produce any derived objects (additional tables or views) needed in your analysis.
3. SQL to produce the data behind any charts or tables presented in your analysis.
Part 4: Write-up
Write up your results as a report with an introduction, a detailed discussion of your analysis, and a conclusion. Use the Simpsons tutorial as a template for format. The discussion should include the following:
1. Describe your data: What is the source of your data? What fields are present? Do you have any reason to trust/distrust the data source? Is the data complete enough to answer your questions?
2. Answer your questions: What steps did you take to answer your questions? What assumptions did you have to make? How does your data support your answers? This should be organized in a logical way.
3. Draw overall conclusions: Did the answers to your questions lead to any insights? Was anything surprising? What are the limitations of your conclusions? What follow-up questions do you have? Is there anything else that stood out in your analysis?