# EOSC 510 · Data Analysis in Atmospheric, Earth and Ocean Sciences

**Instructor**: Valentina Radic

**Course Description**

This is a course for graduate-level students in the Sciences, in which students will gain quantitative skills for tackling a large range of problems in data analysis and empirical modeling. Although the skills learned in the course are applicable to any field of natural sciences, the course examples are mainly drawn from Earth, Ocean and Atmospheric Sciences. The course will equip students with methods and techniques applicable to a broad range of research problems involving field, experimental, observational and modeled data. The emphasis is on practical applications of data analysis and machine learning methods on actual datasets in order to enable the students to answer a large set of research questions involving 'big' data.

Examples of research problems and methods for their tackling:**Questions**: What is a relationship among multiple variables in a given system/phenomena? What variables are the dominant drivers of a given phenomena?**Methods**: Linear regression, multiple linear regression, stepwise regression

**Questions**: What are the most significant modes (behaviors) in a system and how are they inter-related? How to 'compress' a big data without loosing its essential information, i.e. how to meaningfully reduce degrees of freedom in a system?**Methods**: Principal component analysis and canonical component analysis

**Questions**: How to decompose a noisy signal in order to find any signals of interest? How to effectively analyze a time series?**Methods**: Fourier spectral analysis, filters, and singular spectrum analysis

**Questions**: What are the most characteristic features (temporal or spatial patterns) in a given large dataset? How to split a large dataset into ‘meaningful’ clusters/groups?**Methods**: Classification and clustering (e.g. Self-Organizing Maps, hierarchical clustering)

**Questions**: How to derive a model just by using the data, i.e. without any a priori knowledge of the physical processes in the system? How to test the performance of such model? How to correctly calibrate, validate and test empirical models?**Methods**: Feed-forward neural network models, machine learning/training, optimization and generalization

**Course Outline**

EOSC510 is taught through lectures and labs, and through online material (online videos and texts). The online material covers mathematical derivations and theoretical development of a given data analysis/machine learning method. The lectures aim to demonstrate in detail the applications of a given data analysis/machine learning method on a variety of different datasets from Earth, Ocean and Atmospheric Sciences. The main goal of the lectures is to develop a conceptual and practical understanding of the learned methods by demonstrating how they work in practice, i.e. when applied on synthetic and real datasets. During the lectures, the instructor will explain the objectives of a given method, describe the dataset, walk the students through a set of MATLAB scripts that implement the given method, present results and lead an interactive class discussion on advantages and limitations of the method. The labs are designed as 'workshops' where students will do programing in MATLAB to solve a given set of data analysis problems (using methods that are covered in the lectures). During the labs, students can work individually or in pairs, and the instructor is there to assist them (one-on-one or in groups) and to provide any guidance needed for the given tasks. For each lab, a project/exercise description will be given introducing a dataset and outlining a set of questions for students to answer. The instructions with some guidelines will also be provided.

During each week students are expected to view/read the online material (ca 1.5 hours per week) and come prepared to the weekly lecture (1.5 hours) on Tuesday. Labs on Thursdays will consist of 'hand-on' computer exercises (1.5 hours) as a follow-up to the material covered in the lectures. Students need to bring their own laptops to the labs.

**Topic outline by week:**

**• Week 1 (11-17 Sep)***Online lectures/readings before the class (Tue): *Chapter 1. Mean and variance, Correlation, Linear regression, Multiple linear regression, MATLAB programming (Ch1.pdf, Ch1_Q_solns.pdf - PDF file containing solutions to questions posed in the videos)

Recommended: useful material (from EOSC250 course) is you need a recap on calculus and linear algebra *Lecture (Tue)*: Introduction to the course, Multiple linear regression and stepwise regression in MATLAB (example on synthetic and real data); Class presentation and MATLAB scripts*Lab (Thu)*: Intro to MATLAB (Lab1_files), Multiple linear regression and stepwise regression (Lab2_files, Lab2_solutions)**Quiz 1** (to be handed in or emailed to instructor on Thu, 14th Sep)

**• Week 2 (18-24 Sep)***Online lectures/readings before the class (Tue): *Chapter 2. Principal component analysis (PCA) and rotated PCA: Geometric approach, Eigenvector approach, Complex data; orthogonality (Ch2a.pdf, Ch2_Q_solns.pdf)*Lecture (Tue)*: PCA in MATLAB (example on synthetic data); Class presentation and MATLAB scripts*Lab (Thu)*: PCA (Lab3_files, Lab3_solutions)**Quiz 2** (to be handed in or emailed to instructor on Thu, 21st Sep)**ASSIGNMENT 1** (to be emailed to TA by 11:59 pm on Mon, 2nd Oct)

**• Week 3 (25 Sep - 1 Oct)***Online lectures/readings before the class (Tue): *Chapter 2. PCA applied on real data, Scaling; degeneracy, Smaller covariance matrix; mean removal, Singular value decomposition, Missing data; significance tests (Ch2b.pdf)*Lecture (Tue)*: PCA in MATLAB (example on real data); Class presentation and MATLAB scripts - big file (99 MB)!*Lab (Thu)*: PCA (Lab4_files, Lab4_solutions)**Quiz 3** (to be handed in or emailed to instructor on Thu, 28th Sep)

**• Week 4 (2-8 Oct)***Online lectures/readings before the class (Tue): *Chapter 2. Rotated PCA, Varimax; teleconnection patterns, PCA versus Rotated PCA, (Optional: PCA for vectors),

Chapter 3. Canonical correlation analysis (CCA), CCA theory (part 1), CCA theory (part 2), Pre-filter by PCA, Maximum covariance analysis (Ch2c.pdf, Ch3.pdf, Ch3_Q_solns.pdf)*Lecture (Tue)*: Rotated PCA and CCA in MATLAB (synthetic and real data); Class presentation and MATLAB scripts - big file (35 MB)!*Lab (Thu)*: Rotated PCA and CCA (Lab5_files, Lab5_solutions)**Quiz 4** (to be handed in or emailed to instructor on Thu, 5th Oct)**ASSIGNMENT 2** (to be emailed to TA by 11:59 pm on Mon, 23rd Oct)

**• Week 5 (9-15 Oct)***Online lectures/readings before the class (Tue): *Chapter 4. Time series, Fourier spectral analysis: autospectrum, Autospectrum (part 1), Autospectrum (part 2), Cross-spectrum (Ch4a.pdf, Ch4_Q_solns.pdf)*Lecture (Tue)*: FSA on synthetic data in MATLAB; Class presentation and MATLAB scripts *Lab (Thu)*: FSA on real data (Lab6_files, Lab6_solutions)**Quiz 5** (to be handed in or emailed to instructor on Thu, 12th Oct)

**• Week 6 (16-22 Oct)***Online lectures/readings before the class (Tue): *Chapter 4. Windows, Filters (part1), Filters (part2), Singular spectrum analysis, Multichannel singular spectrum analysis (Ch4b.pdf, Ch4c.pdf)*Lecture (Tue)*: filtering and SSA in MATLAB; Class presentation and MATLAB scripts *Lab (Thu)*: filtering and SSA (Lab7_files, Lab7_1Dfiltering_solutions, Lab7_2Dfiltering_solutions)**Quiz 6** (to be handed in or emailed to instructor on Thu, 19th Oct)

**• Week 7 (23-29 Oct)***Online lectures/readings before the class (Tue): *Chapter 5. Classification and clustering, Classification: k-nearest neighbour classifier, Conditional probabilities, Bayes' theorem, Logistic regression, Clustering: k-means clustering, Hierarchical clustering (Ch5a.pdf, Ch5_Q_solns.pdf, Ch5b.pdf)*Lecture (Tue)*: Clustering and intro to Self-organizing maps (SOMs); Class presentation and MATLAB scripts*Lab (Thu)*: Clustering (Lab8_files, Lab8_solutions)**Quiz 7** (to be handed in or emailed to instructor on Thu, 26th Oct)**ASSIGNMENT 3** (spec.mat, spec.txt) (to be emailed to TA by 11:59 pm on Mon, 13th Nov)

**• Week 8 (30 Oct - 5 Nov)***Online lectures/readings before the class (Tue): *Chapter 5. Self-organizing maps, Chapter 6. Feed-forward neural network models: McCulloch and Pitts model, Perceptrons, Limitations of perceptrons (Ch5c.pdf, Ch6a.pdf, Ch6_Q_solns.pdf)*Lecture (Tue)*: Application of Self-organizing maps (SOMs) in MATLAB; Class presentation and MATLAB scripts, somtoolbox*Lab (Thu)*: SOMs (Lab9_files, Lab9_solutions)

**Quiz 8** (to be handed in or emailed to instructor on Thu, 2nd Nov)**Guidelines for project presentation**

**• Week 9 (6-12 Nov)***Online lectures/readings before the class (Tue): *Chapter 6. Multi-layer perceptrons (MLP) - part 1, MLP - part 2, MLP - part 3, Back-propagation, Hidden neurons, MLP classifier (Ch6b.pdf)*Lecture (Tue)*: Neural Network modelling (example on synthetic data); Class presentation and MATLAB scripts*Lab (Thu)*: Neural Network modelling (Lab10_files, Lab10_solutions)

**Quiz 9** (to be handed in or emailed to instructor on Thu, 9th Nov)

**• Week 10 (13-19 Nov)***Online lectures/readings before the class (Tue): *Chapter 7. Nonlinear optimization, Gradient descent methods, Chapter 8. Learning and generalization: Mean squared error and maximum likelihood, Objective functions and robustness, Variance and bias errors, Regularization (Ch7.pdf, Ch8a.pdf, Ch7_Q_solns.pdf, Ch8_Q_solns.pdf)*Lecture (Tue)*: Neural Network modelling (example on real data); Class presentation and MATLAB scripts*Lab (Thu)*: Neural Network modelling, Work on students' projects (Lab11_files)**Quiz 10** (to be handed in or emailed to instructor on Thu, 16th Nov)**ASSIGNMENT 4** (data file) (to be emailed to TA by 11:59 pm on Mon, 4th Dec)

**• Week 11 (20-26 Nov)***Online lectures/readings before the class (Tue): *Chapter 8.* *Cross-validation, Bayesian neural networks, Errors of ensembles, Nonlinear ensemble averaging; boosting, Linearization from time-averaging, Regularization of linear models (Ch8b.pdf)*Lab (Tue)*: Work on students' projects *Presentations (Thu)*: Students' project presentations (part 1)**Grading rubric for presentations**

**• Week 12 (27 Nov - 3 Dec)***Presentations (Tue)*: Students' project presentations (part 2)*Presentations (Thu)*: Students' project presentations (part 3)**Guidelines and grading rubric for reports**

** Main textbook:** Hsieh, W.W., 2009. Machine Learning Methods in the Environmental Sciences. Cambridge Univ. Pr., 349 pp.

*selected from William Hsieh's online lectures developed for the course EOSC510 (Data Analysis in Atmospheric, Earth and Ocean Sciences). Original lectures from EOSC510 are posted at: http://www.ocgy.ubc.ca/~william/EOSC510*

**Online lectures and readings**:**Evaluation**

The students will be evaluated based on the following expectations and timeline: by the mid-term students will need to develop an idea about the data they want to analyze and method(s) they want to use, and then talk to the instructor about this during the office hours and/or during the labs. The instructor will provide feedback and guidance to each student directly. Guidance on the expectations (and evaluation rubric) for the presentations and final reports will be explained in detail during one of the lectures (week 7 or 8). After the presentations, where students present the methodology and some preliminary results of their data projects, the instructor will provide feedback and, if needed, any additional guidance (by email or/and in person). The final report is to be submitted 3 weeks after the presentations. This will allow enough time for the students to incorporate the feedback and finalize the reports. Depending on the enrolment (e.g. if more than 15 students are enrolled) in the course the students would be asked to work on the projects in pairs rather than individually.

**Proposed breakdown of assessments:****Quizzes (10%):**

A quiz will be given every week for homework. The quizzes, consisting of multiple questions, are designed to help students to recap the online lectures/readings and improve their understanding of the new material. There are 10 quizzes in total, each one carries 1%.

**Assignments (40%):**

Throughout the course there are four assignments that are designed as a short data analysis/modeling projects. Students are asked to apply a newly learned method/technique on a given dataset and provide a brief discussion of the results (guidelines for the discussions are provided in each assignment). Each assignment contributes 10% to the total grade.

**Presentation on data projects (15%):**

Towards the end of the course students are expected to find a dataset they want to analyze/model with the tool(s) learned from the course. During the last week of the course students present (as a 5-7 min talk) preliminary results of their individual data analysis projects.

**Data project report (35%):**

Final outcome of the data projects are to be synthesized in a written report to be handed in by the end of the exam period.