Introduction to Data Mining
Instructor: Abdullah Mueen
Time: 12:30 pm - 1:45 pm
Room: Centennial Engineering Center B146A
Office Hours: Tuesday and Thursday, 10:00AM-12:00PM
Office: FEC 340 (Knock if the door is closed)
TA: Nikan Chavoshi
Office Hours: Tue-Thu, 3:00PM-4:30PM
Office: FEC 345(B).
Announcements (most recent on top):
- Poster session will be on Monday 8th December, 2014, 12:00PM-2:00PM at the Stamm room 1044 in CEC.
- Assignment Four (Extra Credit worth of 3% of the class) has been posted. Due: Dec 9th by 11:59PM.
- Final Exam will be on Thursday, Dec 4th, 2014 in the class.
- Assignment 3 has been posted, Due Nov 30th, Sunday, by 11:59PM.
- Homework 4 posted. Due Nov 20th, Thursday, in the Class.
- Homework 3 has been posted. Due date is Oct 30th, Thursday, in the class.
- Assignment 2 has been posted. Due date is Oct 19th, Friday by 11:59PM. OCT 24th, Friday by 11;59PM.
- Project proposal details posted. Due: Friday Oct 10th, 2014 by 11:59PM. Sunday Oct 12th, 2014 by 11:59PM.
- Homework 2 posted. Due: Tuesday, 09/30/2014 in the class.
- Assignment 1 posted. Due: Friday, 09/19/2014 by 11:59PM.
- Project Proposals are due on Oct 10th, 2014 by 11:59PM.
- Midterm will be on the Tuesday, Oct 7th, 2014 in the class.
- Homework 1 posted. Due: 9/11/2014, Thursday in the class.
- The email address for the teachers is cs591.teachers@gmail.com
- The google groups for the class is cs491-591-fall2014
- Groups are due by Tuesday 09/02/2014, 12:00PM
Syllabus
Description: This
course covers data mining topics from basic to advanced level. Topics include data
cleaning, clustering, classification, outlier detection, association-rule
discovery, tools and technologies for data mining and algorithms for mining complex
data such as graphs, text and sequences. Students will work on a data mining
project to gather hands-on experience.
The course learning objectives include
- Learning
basic data mining algorithms and their applications
- Learning
about the tools and technologies available for analyzing various types of
data
- Gaining
hands-on experience in cleaning, managing and processing complex data.
Book: Data Mining: Concepts and Techniques, 3rd ed.
Lecture Schedule: Here
Grading: There
will be two exams. One midterm on topics from weeks 1-7 and the final
on the
reminder of the topics. The exams are worth 25% each. Students will
pick group-projects
and apply mining algorithms. Project is worth 20%. There will be three
to five homework, together they are 10% of the class. There will be
four assignments worth 5% each. Homework will focus on understanding
the
algorithms and techniques. The assignments will be on
applying different
techniques on real-data selected by the instructor.
Academic Integrity: For everyone's benefit, students should uphold the guidelines in the University of New Mexico Student Code of Conduct.
For the assignments in this class, discussion of concepts with others is encouraged, but all assignments must be done on your own,
unless otherwise instructed. If you use any source other than the text,
reference it/him/her, whether it be a person, a book, a solution set, a
web page or whatever. You MUST write up the solutions in your own words. Copying is strictly forbidden.
Americans with Disabilities Act (ADA) Policy Statement: The
Americans with Disabilities Act (ADA) is a federal antidiscrimination
statute that provides comprehensive civil rights protection for persons
with disabilities. Among other things, this legislation requires that
all students with disabilities be guaranteed a learning environment
that provides for reasonable accommodation of their disabilities. If
you believe you have a disability requiring an accommodation, please
contact the Department of Student Affairs, Accessibility Resource Center in Mesa Vista Hall, Rm. 2021.
Academic Calendar: For a list of dates to enroll, change, withdraw classes and a list of hoildays go here.
Project: Each group will do one project. A group can have at most two
students. Students in the CS 491 section can have groups of three students. A project consists of two phases with equal weights.
- Data Preprocessing and Cleaning:
Each group will propose a data source or pick a data from a given list. Each
group will propose data mining tasks, a set of algorithms/tools and
success measures. Groups will clean the data for the projects and submit
the written proposals by Oct 12th, 2014.
Details: Here is a proposal from last year that was well strucutred. I need the following sections.
Title, Introduction, Data (collection and preprocessing),
Hypothesis, Proposed Method, Validation and Conclusion. I need clear
answers to the following questions;
What
data you will be using? How is formatted? What is the size of the data?
How you will clean the data? How will you process the data?
What hypothesis/hope do you have? How would you prove or disprove your hypothesis?
What methods will you use? What software tools will you use? How much programming does it need?
How do you validate your method is working? How does that relate to proving your hypothesis?
- Implementation and Presentation:
Each group will implement the project and write up the methods and results
in the final project report. The groups will present and demonstrate
their projects in the class or in a poster session. A poster template is here.
Details:
Poster session will be on Monday, 8th December, 2014, 12:00PM-2:00PM.
Students are advised to print their posters well ahead to avoid forming
long queue in the printer. Poster session will be in the Centennial
Engineering Center’s Stamm Room
1044. We will provide velcro stickers for hanging. I will be visiting
your posters and grade them. Do NOT leave the room until I see your
poster. If you have questions, email me.
Homework:
No late assignments will be accepted. There will be no make-up exams except for university-excused absences. Please discuss unusual circumstances in advance with the instructor.
Homework 1: Here
Due: Thursday, 09/11/2014, beginning of the lecture. No electronic
submission. Only paper-based submission. You have to show steps clearly
to convince us that you did it yourself. Solution
Homework 2: Here Due: Tuesday, 09/30/2014, in the class. No electronic
submission. Only paper-based submission. You have to show steps clearly
to convince us that you did it yourself.
Homework 3: Exercises 10.2, 10.7 and 10.8. Due Oct 30th, Thursday, in the class. Only paper-based submission.
Homework 4: Here Due: November 20th, Thursday in the class. Only paper submission.
Assignments:
No late assignments will be accepted. There will be no make-up exams except for university-excused absences. Please discuss unusual circumstances in advance with the instructor.
Assignment 1: Here.
Due: Friday, 09/19/2014 by 11:59PM. Only electronic submissions to the
teachers email address. We will not open submissions in our personal
inbox.
Assignment 2: Due: Friday Oct 19, 2014 by 11:59PM. Only Electronic Submissions to the teachers email address.
Use the dataset from the previous assignment. Submit your code so I can reproduce the reported numbers for the classifiers
a) Label the first 5000 rows as class 1 and the remaining rows as class
2. Use SVM and Neural Network to classify the data and report 10-fold
cross-validated accuracy. Describe the parameters of your classifiers.
b) Label the rows [1:500,1001:1500,2001:2500,3001:3500,4001:4500,5001:5500,6001:6500,7001:7500,8001:8500,9001:9500]
as class 1 and the remaining rows as class 2. Use SVM and Neural
Network to classify the data and report 10-fold cross-validated
accuracy. Describe the parameters of your classifiers.
SVM code snippet from the class.
Assignment
3: Due November 30th by 11:59PM. Only Electronic Submissions to
the teachers email address. Submit your code and plot. Describe any
assumption that you required to make.
a) Implement the Local Outlier Factor algorithm to find the LOFs of all the points in the dataset from Assignment 1.
b)
Produce a plot for different values of k (i.e. 1 to 100) that shows the
number of outliers. Use a threshold of 2 for deciding if a point is an
outlier.
Assignment 4: Due Dec 9th by 11:59PM. Online submissions only. For the given dataset, use a locality sensitive hashing scheme to search for approximate nearest neghbors. Use the following queryset. You can use any parameter choices to obtain the nearest neighbors.
Deliverables: 1. The approximate nearest neighbors of the queries.
2. Describe all the parameters and the reason for
choosing them.
3. The code
for building the hash table and searching the tables.
Data: Links
to some data sources (in no order) you can use for the course projects.
You are welcome to suggest any dataset of your choice preferably large,
noisy and (semi/un)structured.
- Social
Network Graph of Twitter
- GPS Trajectories from Microsoft Research
- Tiny Images
Dataset from MIT.
- Remote
Sensing Data from NASA. Direct download link for the
product MOD09CMG.005.
- 83 million
Twits from Twitter
- Daily
Currency Conversion Rates between USD and others.
- Daily Values of Stock Tickers
- CMU Motion
Capture Database
- MIR FLICKR
- Geo-tagged image data
- Video with GPS ground-truth
- ABQ Data
Tools:
Slides: