**ITS-632 Intro to Data Mining**

Dr. Steven Case

Dept. of Information Technology &

School of Computer and Information Sciences

University of the Cumberlands

**Chapter 2 Assignment**

- What’s an attribute? What’s a data instance?

- What’s noise? How can noise be reduced in a dataset?

- Define outlier. Describe 2 different approaches to detect outliers in a dataset.

- Describe 3 different techniques to deal with missing values in a dataset. Explain when each of these techniques would be most appropriate.

- Given a sample dataset with missing values, apply an appropriate technique to deal with them.

- Give 2 examples in which aggregation is useful.

- Given a sample dataset, apply aggregation of data values.

- What’s sampling?

- What’s simple random sampling? Is it possible to sample data instances using a distribution different from the uniform distribution? If so, give an example of a probability distribution of the data instances that is different from uniform (i.e., equal probability).

- What’s stratified sampling?

- What’s “the curse of dimensionality”?

- Provide a brief description of what Principal Components Analysis (PCA) does. [Hint: See Appendix A and your lecture notes.] State what’s the input and what the output of PCA is.

- What’s the difference between dimensionality reduction and feature selection?

- Describe in detail 2 different techniques for feature selection.

- Given a sample dataset (represented by a set of attributes, a correlation matrix, a co-variance matrix, …), apply feature selection techniques to select the best attributes to keep (or equivalently, the best attributes to remove).

- What’s the difference between feature selection and feature extraction?

- Give two examples of data in which feature extraction would be useful.

- Given a sample dataset, apply feature extraction.

- What’s data discretization and when is it needed?

- What’s the difference between supervised and unsupervised discretization?

- Given a sample dataset, apply unsupervised (e.g., equal width, equal frequency) discretization, or supervised discretization (e.g., using entropy).

- Describe 2 approaches to handle nominal attributes with too many values.

- Given a dataset, apply variable transformation: Either a simple given function, normalization, or standardization.

- Definition of Correlation and Covariance, and how to use them in data pre-processing (see pp. 76-78).