Introduction to Data Mining

Instructor: Abdullah Mueen
Time: 12:30 pm - 1:45 pm
Room: Centennial Engineering Center B146A
Office Hours: Tuesday and Thursday, 10:00AM-12:00PM
Office: FEC 340 (Knock if the door is closed)

TA: Nikan Chavoshi
Office Hours: Tue-Thu, 3:00PM-4:30PM
Office: FEC 345(B).

Announcements (most recent on top):


Syllabus

Description: This course covers data mining topics from basic to advanced level. Topics include data cleaning, clustering, classification, outlier detection, association-rule discovery, tools and technologies for data mining and algorithms for mining complex data such as graphs, text and sequences. Students will work on a data mining project to gather hands-on experience.

The course learning objectives include


Book: Data Mining: Concepts and Techniques, 3rd ed.
Lecture Schedule: Here

Grading: There will be two exams. One midterm on topics from weeks 1-7 and the final on the reminder of the topics. The exams are worth 25% each. Students will pick group-projects and apply mining algorithms. Project is worth 20%. There will be three to five homework, together they are 10% of the class. There will be four assignments worth 5% each. Homework will focus on understanding the algorithms and techniques. The assignments will be on applying different techniques on real-data selected by the instructor.

Academic Integrity:
 For everyone's benefit, students should uphold the guidelines in the University of New Mexico Student Code of Conduct.

For the assignments in this class, discussion of concepts with others is encouraged, but all assignments must be done on your own, unless otherwise instructed. If you use any source other than the text, reference it/him/her, whether it be a person, a book, a solution set, a web page or whatever. You MUST write up the solutions in your own words. Copying is strictly forbidden. 

Americans with Disabilities Act (ADA) Policy Statement: The Americans with Disabilities Act (ADA) is a federal antidiscrimination statute that provides comprehensive civil rights protection for persons with disabilities. Among other things, this legislation requires that all students with disabilities be guaranteed a learning environment that provides for reasonable accommodation of their disabilities. If you believe you have a disability requiring an accommodation, please contact the Department of Student Affairs, Accessibility Resource Center in Mesa Vista Hall, Rm. 2021. 

Academic Calendar: For a list of dates to enroll, change, withdraw classes and a list of hoildays go here.

Project: Each group will do one project. A group can have at most two students. Students in the CS 491 section can have groups of three students. A project consists of two phases with equal weights.

  1. Data Preprocessing and Cleaning: Each group will propose a data source or pick a data from a given list. Each group will propose data mining tasks, a set of algorithms/tools and success measures. Groups will clean the data for the projects and submit the written proposals by Oct 12th, 2014.

    Details: Here is a proposal from last year that was well strucutred. I need the following sections. Title, Introduction, Data (collection and preprocessing), Hypothesis, Proposed Method, Validation and Conclusion. I need clear answers to the following questions;

    What data you will be using? How is formatted? What is the size of the data? How you will clean the data? How will you process the data?
    What hypothesis/hope do you have? How would you prove or disprove your hypothesis?
    What methods will you use? What software tools will you use? How much programming does it need?
    How do you validate your method is working? How does that relate to proving your hypothesis?

  2. Implementation and Presentation: Each group will implement the project and write up the methods and results in the final project report. The groups will present and demonstrate their projects in the class or in a poster session. A poster template is here.

    Details: Poster session will be on Monday, 8th December, 2014, 12:00PM-2:00PM. Students are advised to print their posters well ahead to avoid forming long queue in the printer. Poster session will be in the Centennial Engineering Center’s Stamm Room 1044. We will provide velcro stickers for hanging. I will be visiting your posters and grade them. Do NOT leave the room until I see your poster. If you have questions, email me.


Homework:

No late assignments will be accepted. There will be no make-up exams except for university-excused absences. Please discuss unusual circumstances in advance with the instructor.

Homework 1:  Here Due: Thursday, 09/11/2014, beginning of the lecture. No electronic submission. Only paper-based submission. You have to show steps clearly to convince us that you did it yourself. Solution

Homework 2: Here Due: Tuesday, 09/30/2014, in the class. No electronic submission. Only paper-based submission. You have to show steps clearly to convince us that you did it yourself.

Homework 3: Exercises 10.2, 10.7 and 10.8. Due Oct 30th, Thursday, in the class. Only paper-based submission.

Homework 4: Here Due: November 20th, Thursday in the class. Only paper submission.

Assignments:

No late assignments will be accepted. There will be no make-up exams except for university-excused absences. Please discuss unusual circumstances in advance with the instructor.
 

Assignment 1: Here. Due: Friday, 09/19/2014 by 11:59PM. Only electronic submissions to the teachers email address. We will not open submissions in our personal inbox. 

Assignment 2: Due: Friday Oct 19, 2014 by 11:59PM. Only Electronic Submissions to the teachers email address. Use the dataset from the previous assignment. Submit your code so I can reproduce the reported numbers for the classifiers
a) Label the first 5000 rows as class 1 and the remaining rows as class 2. Use SVM and Neural Network to classify the data and report 10-fold cross-validated accuracy. Describe the parameters of your classifiers.
b) Label the rows [1:500,1001:1500,2001:2500,3001:3500,4001:4500,5001:5500,6001:6500,7001:7500,8001:8500,9001:9500]
as class 1 and the remaining rows as class 2. Use SVM and Neural Network to classify the data and report 10-fold cross-validated accuracy. Describe the parameters of your classifiers.
SVM code snippet from the class.

Assignment 3: Due November 30th by 11:59PM. Only Electronic Submissions to the teachers email address. Submit your code and plot. Describe any assumption that you required to make.

a) Implement the Local Outlier Factor algorithm to find the LOFs of all the points in the dataset from Assignment 1.
b) Produce a plot for different values of k (i.e. 1 to 100) that shows the number of outliers. Use a threshold of 2 for deciding if a point is an outlier.

Assignment 4: Due Dec 9th by 11:59PM. Online submissions only. For the given dataset, use a locality sensitive hashing scheme to search for approximate nearest neghbors. Use the following queryset. You can use any parameter choices to obtain the nearest neighbors.
Deliverables: 1. The approximate nearest neighbors of the queries.
                    2. Describe all the parameters and the reason for choosing them.
                    3. The code for building the hash table and searching the tables.

Data: Links to some data sources (in no order) you can use for the course projects. You are welcome to suggest any dataset of your choice preferably large, noisy and (semi/un)structured.
  1. Social Network Graph of Twitter
  2. GPS Trajectories from Microsoft Research
  3. Tiny Images Dataset from MIT.
  4. Remote Sensing Data from NASA. Direct download link for the product MOD09CMG.005.
  5. 83 million Twits from Twitter
  6. Daily Currency Conversion Rates between USD and others.
  7. Daily Values of Stock Tickers
  8. CMU Motion Capture Database
  9. MIR FLICKR
  10. Geo-tagged image data
  11. Video with GPS ground-truth
  12. ABQ Data

Tools:
Slides: