EOSC 510 · Data Analysis in Atmospheric, Earth and Ocean Sciences
Instructor: Valentina Radic
This is a course for graduate-level students in the Sciences, in which students will gain quantitative skills for tackling a large range of problems in data analysis and empirical modeling. Although the skills learned in the course are applicable to any field of natural sciences, the course examples are mainly drawn from Earth, Ocean and Atmospheric Sciences. The course will equip students with methods and techniques applicable to a broad range of research problems involving field, experimental, observational and modeled data. The emphasis is on practical applications of data analysis and machine learning methods on actual datasets in order to enable the students to answer a large set of research questions involving 'big' data.
Examples of research problems and methods for their tackling:
Questions: What is a relationship among multiple variables in a given system/phenomena? What variables are the dominant drivers of a given phenomena?
Methods: Linear regression, multiple linear regression, stepwise regression
Questions: What are the most significant modes (behaviors) in a system and how are they inter-related? How to 'compress' a big data without loosing its essential information, i.e. how to meaningfully reduce degrees of freedom in a system?
Methods: Principal component analysis and canonical component analysis
Questions: How to decompose a noisy signal in order to find any signals of interest? How to effectively analyze a time series?
Methods: Fourier spectral analysis, filters, and singular spectrum analysis
Questions: What are the most characteristic features (temporal or spatial patterns) in a given large dataset? How to split a large dataset into ‘meaningful’ clusters/groups?
Methods: Classification and clustering (e.g. Self-Organizing Maps, hierarchical clustering)
Questions: How to derive a model just by using the data, i.e. without any a priori knowledge of the physical processes in the system? How to test the performance of such model? How to correctly calibrate, validate and test empirical models?
Methods: Feed-forward neural network models, machine learning/training, optimization and generalization
EOSC510 is taught through lectures and labs, and through online material (online videos and texts). The online material covers mathematical derivations and theoretical development of a given data analysis/machine learning method. The lectures aim to demonstrate in detail the applications of a given data analysis/machine learning method on a variety of different datasets from Earth, Ocean and Atmospheric Sciences. The main goal of the lectures is to develop a conceptual and practical understanding of the learned methods by demonstrating how they work in practice, i.e. when applied on synthetic and real datasets. During the lectures, the instructor will explain the objectives of a given method, describe the dataset, walk the students through a set of MATLAB scripts that implement the given method, present results and lead an interactive class discussion on advantages and limitations of the method. The labs are designed as 'workshops' where students will do programing in MATLAB to solve a given set of data analysis problems (using methods that are covered in the lectures). During the labs, students can work individually or in pairs, and the instructor is there to assist them (one-on-one or in groups) and to provide any guidance needed for the given tasks. For each lab, a project/exercise description will be given introducing a dataset and outlining a set of questions for students to answer. The instructions with some guidelines will also be provided.
During each week students are expected to view/read the online material (ca 1.5 hours per week) and come prepared to the weekly lecture (1.5 hours) on Tuesday. Labs on Thursdays will consist of 'hand-on' computer exercises (1.5 hours) as a follow-up to the material covered in the lectures. Students need to bring their own laptops to the labs.
Topic outline by week:
• Week 1 (11-17 Sep)
Online lectures/readings before the class (Tue): Chapter 1. Mean and variance, Correlation, Linear regression, Multiple linear regression, MATLAB programming (Ch1.pdf, Ch1_Q_solns.pdf - PDF file containing solutions to questions posed in the videos)
Recommended: useful material (from EOSC250 course) is you need a recap on calculus and linear algebra
Lecture (Tue): Introduction to the course, Multiple linear regression and stepwise regression in MATLAB (example on synthetic and real data); Class presentation and MATLAB scripts
Lab (Thu): Intro to MATLAB (Lab1_files), Multiple linear regression and stepwise regression (Lab2_files, Lab2_solutions)
Quiz 1 (to be handed in or emailed to instructor on Thu, 14th Sep)
• Week 2 (18-24 Sep)
Online lectures/readings before the class (Tue): Chapter 2. Principal component analysis (PCA) and rotated PCA: Geometric approach, Eigenvector approach, Complex data; orthogonality (Ch2a.pdf, Ch2_Q_solns.pdf)
Lecture (Tue): PCA in MATLAB (example on synthetic data); Class presentation and MATLAB scripts
Lab (Thu): PCA (Lab3_files, Lab3_solutions)
Quiz 2 (to be handed in or emailed to instructor on Thu, 21st Sep)
ASSIGNMENT 1 (to be emailed to TA by Mon, 2nd Oct)
• Week 3 (25 Sep - 1 Oct)
Online lectures/readings before the class (Tue): Chapter 2. PCA applied on real data, Scaling; degeneracy, Smaller covariance matrix; mean removal, Singular value decomposition, Missing data; significance tests (Ch2b.pdf)
Lecture (Tue): PCA in MATLAB (example on real data)
Lab (Thu): PCA
Quiz 3 (to be handed in or emailed to instructor on Thu, 28th Sep)
• Week 4 (2-8 Oct)
Online lectures/readings before the class (Tue): Chapter 2. Rotated PCA, Varimax; teleconnection patterns, PCA versus Rotated PCA, Chapter 3. Canonical correlation analysis (CCA), CCA theory, Pre-filter by PCA, Maximum covariance analysis
Lecture (Tue): Rotated PCA and CCA in MATLAB (synthetic and real data)
Lab (Thu): Rotated PCA and CCA
• Week 5 (9-15 Oct)
Online lectures/readings before the class (Tue): Chapter 4. Time series, Fourier spectral analysis: autospectrum, Cross-spectrum
Lecture (Tue): FSA on synthetic data in MATLAB
Lab (Thu): FSA on real data
• Week 6 (16-22 Oct)
Online lectures/readings before the class (Tue): Chapter 4. Windows, Filters, Singular spectrum analysis
Lecture (Tue): moving average and SSA
Lab (Thu): moving average and SSA
• Week 7 (23-29 Oct)
Online lectures/readings before the class (Tue): Chapter 5. Classification and clustering, Classification: k-nearest neighbour classifier, Conditional probabilities, Bayes' theorem, Logistic regression, Clustering: k-means clustering, Hierarchical clustering
Lecture (Tue): Classification and Clustering
Lab (Thu): Clustering
• Week 8 (30 Oct - 5 Nov)
Online lectures/readings before the class (Tue): Chapter 5. Self-organizing maps, Chapter 6. Feed-forward neural network models, McCulloch and Pitts model, Perceptrons, Limitations of perceptrons
Lecture (Tue): Self-organizing maps (SOMs)
Lab (Thu): SOMs
• Week 9 (6-12 Nov)
Online lectures/readings before the class (Tue): Chapter 6. Multi-layer perceptrons, Back-propagation, Hidden neurons, MLP classifier, Chapter 7. Nonlinear optimization, Gradient descent methods
Lecture (Tue): Neural Network modelling (example on synthetic data)
Lab (Thu): Neural Network modelling
• Week 10 (13-19 Nov)
Online lectures/readings before the class (Tue): Chapter 8. Learning and generalization, Mean squared error and maximum likelihood, Objective functions and robustness, Variance and bias errors, Regularization, Cross-validation
Lecture (Tue): Neural Network modelling (example on real data)
Lab (Thu): Neural Network modelling, Work on students' projects
• Week 11 (20-26 Nov)
Online lectures/readings before the class (Tue): Chapter 8. Bayesian neural networks, Errors of ensembles, Nonlinear ensemble averaging; boosting, Linearization from time-averaging
Lab (Tue): Work on students' projects
Presentations (Thu): Students' project presentations (part 1)
• Week 12 (27 Nov - 3 Dec)
Presentations (Tue): Students' project presentations (part 2)
Presentations (Thu): Students' project presentations (part 3)
Main textbook: Hsieh, W.W., 2009. Machine Learning Methods in the Environmental Sciences. Cambridge Univ. Pr., 349 pp.
Online lectures and readings: selected from William Hsieh's online lectures developed for the course EOSC510 (Data Analysis in Atmospheric, Earth and Ocean Sciences). Original lectures from EOSC510 are posted at: http://www.ocgy.ubc.ca/~william/EOSC510
The students will be evaluated based on the following expectations and timeline: by the mid-term students will need to develop an idea about the data they want to analyze and method(s) they want to use, and then talk to the instructor about this during the office hours and/or during the labs. The instructor will provide feedback and guidance to each student directly. Guidance on the expectations (and evaluation rubric) for the presentations and final reports will be explained in detail during one of the lectures (week 7 or 8). After the presentations, where students present the methodology and some preliminary results of their data projects, the instructor will provide feedback and, if needed, any additional guidance (by email or/and in person). The final report is to be submitted 3 weeks after the presentations. This will allow enough time for the students to incorporate the feedback and finalize the reports. Depending on the enrolment (e.g. if more than 15 students are enrolled) in the course the students would be asked to work on the projects in pairs rather than individually.
Proposed breakdown of assessments:
A quiz will be given every week for homework. The quizzes, consisting of multiple questions, are designed to help students to recap the online lectures/readings and improve their understanding of the new material. There are 10 quizzes in total, each one carries 1%.
Throughout the course there are four assignments that are designed as a short data analysis/modeling projects. Students are asked to apply a newly learned method/technique on a given dataset and provide a brief discussion of the results (guidelines for the discussions are provided in each assignment). Each assignment contributes 10% to the total grade.
Presentation on data projects (20%):
Towards the end of the course students are expected to find a dataset they want to analyze/model with the tool(s) learned from the course. During the last week of the course students present (as a 5-7 min talk) preliminary results of their individual data analysis projects.
Data project report (30%):
Final outcome of the data projects are to be synthesized in a written report to be handed in by the end of the exam period.