CS 521: Data Mining Techniques
Instructor: Abdullah Mueen
Time: Monday 2:00 pm - 4:30 pm
Room: Centennial Engineering Center B146A
Office Hours: Wednesday 1:30-3:00PM and Thursday, 3:30PM-4:00PM
Office: Travelstead Hall, B01A (Knock if the door is closed)
Use
the following email address to submit any assignment, ask questions and
make comments. Do not send to our personal email address. The email
address is correct.
Email Address: cs521.unm@gmail.comAnnouncements (most recent on top):
- Ensmembling Project is due on Dec 15th, 2016.
- HW3 posted, Due 11/14/16. Submit in the class.
- Outlier Detection Project is due on Nov 20th, 2016.
- Final Exam will be on December 5th, 2016 in the classroom at 2:00PM.
- Clustering Project is due on Oct 30th, 2016.
- HW2 posted, Due 10/26/16. Submit in the class.
- Classification Project is due on Oct 3rd, 2016.
- HW1 posted, Due 9/26/16. Submit in the class.
- Project selections are due by Tuesday 09/01/2016, 11:59PM. Send an email.
Syllabus
Description: This
course covers data mining topics from basic to advanced level. Topics include data
cleaning, clustering, classification, outlier detection, association-rule
discovery, tools and technologies for data mining and algorithms for mining complex
data such as graphs, text and sequences. Students will work on a data mining
project to gather hands-on experience.
The course learning objectives include
- Learning
basic data mining algorithms and their applications
- Learning
about the tools and technologies available for analyzing various types of
data
- Gaining
hands-on experience in cleaning, managing and processing complex data.
Book: Data Mining: Concepts and Techniques, 3rd ed.
We will be occasionally referring to this book
by Charu Aggarwal. The book is freely available to download in campus network.
Lecture Schedule: Here
Grading: There
will a final exam worth 35% of the grade. Students will pick datasets for projects
and apply mining algorithms. Project is worth 40%. There will be three to five
homework, together they are 20% of the course. Homework will focus on
understanding the algorithms and techniques. Remaining 5% will be on class
participation and attendance.
Academic Calendar: For a list of dates to enroll, change, withdraw classes and a list of hoildays go here.
Project:
Each student will do one project. A project consists of four phases with equal
weights.
1. Classification: Perform
classification on the chosen dataset and produce cross-validated
precision/recall numbers.
Due: Oct 3, 2016. Send a report to class
email address. Use plots and charts to describe your project. Write the
report assuming you would submit it for publication in a journal.
Requirements:
- Formulate
the classification problem. Describe it mathematically. What are the
input, what is the desired output? Do you need any preprocessing?
- At least two classifiers must be tested. You must justify your choice in the report.
- Compare the two classifiers based on Precision, Recall and F-measure.
Use leave-one-out classification. If it is time consuming, perform
10-fold cross validation. What is the default classification accuracy?
- Describe the choices of parameters. Test ranges of values for parameters and chosse the best one.
- Discuss your results. Why one method is better than the other? Have you been able to beat the baseline? Why?
2. Clustering: Perform
clustering on the chosen dataset and produce meaningful clusters.
Due: Oct 30, 2016. Send a report to class
email address. Use plots and charts to describe your project. Write the
report assuming you would submit it for publication in a journal.
Requirements:
- Formulate
the clustering problem. Describe it mathematically. What are the
input, what is the desired output? Do you need any preprocessing?
- At
least two clustering algorithms must be tested. You must justify your
choice in the report. Discuss your distance funnction.
- Compare the two clustering algorihtms based on F-measure and NMI.
- Describe the choices of parameters. Test ranges of values for parameters and chosse the best one. Discuss the choices you made.
- Discuss your results. Why one method is better than the other?
3. Outlier Detection: Perform
outlier detection algorithms on the given dataset and identify anomalous
behavior.
Due: November 20, 2016. Send a report to class
email address. Use plots and charts to describe your project. Write the
report assuming you would submit it for publication in a journal.
Requirements:
- Formulate
the outlier detection problem. Describe it mathematically. What are the
input, what is the desired output? Do you need any preprocessing?
- At
least two outlier detection algorithms must be tested. You must justify
your choice in the report. Discuss your distance funnction. You
may want to chose two different types of algorithms (proximity based,
Clustering based, Statistical, etc.)
- Investigate the outliers and discuss why they are outliers.
- Describe the choices of parameters. Test ranges of values for parameters and chosse the best one. Discuss the choices you made.
4. Ensembling: Perform an
ensembling technique to improve accuracy of any of the above tasks.
Due: December 15, 2016. Send a report to class
email address. Use plots and charts to describe your project. Write the
report assuming you would submit it for publication in a journal.
Requirements:
- Implement an additional classifier for your problem.
- Ensemble
the three classifiers (two previously built in part 2 + the new one)
using majority voting. If you have ties, break with some strategy and
mention that.
- Report k-fold accuracy of the ensemble and
discuss if the ensemble is better than the individual classifier.
Sometimes, mean accuracy remains the same while variance among the
folds reduces significantly.
- Combine this report with the previous three report to produce one complete report for the class.
In each phase, a student produces a report describing data
cleaning, method(s), results, and discussions. Phase specific goals will be
announced in the class page. A student will merge four small reports in a final
report and submit in the finals week.
The datasets are
Homework:
No late submissions will be accepted. There will be no make-up exams except for university-excused absences. Please discuss unusual circumstances in advance with the instructor.
Homework 1: Click here. Due: Monday 9/26. Submit in the class
Homework 2: Click Here. Due: Monday 10/26, Submit in the class.
Homework
3: Click Here. Due: Monday 11/14, Submit in the class.
Homework 4:
Tools:
Slides:
No form of discrimination, sexual harassment, or sexual misconduct will be
tolerated in this class or at UNM in general. I strongly encourage you to
report any problems you have in this regard to the appropriate person at UNM.
As described below, I must report any such incidents of which I become aware to
the university. UNM also has confidential counselors available through UNM Student
Health and Counseling (SHAC), UNM Counseling and Referral Services (CARS), and
UNM LoboRespect.
UNM faculty, Teaching Assistants, and Graduate Assistants are
considered "responsible employees" by the Department of Education (see
pg 15 - http://www2.ed.gov/about/offices/list/ocr/docs/qa-201404-title-ix.pdf).
This designation requires that any report of gender discrimination which
includes sexual harassment, sexual misconduct and sexual violence made to a
faculty member, TA, or GA must be reported to the Title IX Coordinator at the
Office of Equal Opportunity (oeo.unm.edu).
Complete more information on the UNM policy regarding sexual misconduct,
including reporting, counseling, and legal options, is available online: https://policy.unm.edu/university-policies/2000/2740.html