Variable selection and anomaly detection using PCA

Dr Insha Ullah is pursuing a post-doc in the Research School of Finance, Actuarial Studies and Statistics at ANU, and he gave the RSFAS seminar on Thursday 27 October. About a dozen people attended in person, with another dozen online.

Insha introduced two related pieces of work that showed the use of principal components analysis. The first was around variable selection, particularly in situations where the number of variables, p, far exceeds the number of cases, n. it turns out LASSO regression is not appropriate when n is small, so another approach is needed. Smyth’s moderated t statistics introduced in his 2004 paper have received over 13,000 citations to date and have clearly proven extremely useful in a variety of situations, but they also were not the route Insha wished to pursue.

Insha was able to show through a simulation study that the PCA variable selection method he outlined did very erl for 300 variables and between 10 and 100 cases. He also showed a real-world example with 22800 genes as the variables and 190 cases.

In the second case study, Insha showed how to use PCA for anomaly detection in computer networks. This time the supporting simulation study involved a cool 10000 cases to train the algorithm and 500 to test (with 100 contaminated cases planted in the test data). The real-world example was from a publicly available anomaly detection dataset called KDD’99, involving 5 million cases of which around a quarter were anomalous.

There isn’t an R package for this – Insha explained that the code is not very many lines long and can be found at the end of the papers on arXiv!

Dennis Trewin Prize 2022

The Dennis Trewin Prize, named after the former Australian Statistician Dennis Trewin AO, is awarded annually by the Canberra Branch of the Statistical Society of Australia for outstanding research in statistics or data science by a current or recently graduated postgraduate student from a ACT or regional NSW (excluding Sydney-Newcastle-Wollongong) university. The 2022 competition took place on Tuesday 25 October, online and in person. Over 20 people attended across the two platforms.


This year I was honoured to be invited to be a judge of the three short-listed candidates: Dr Fui Swen Kuh of Monash University, Mr Zhi Yang Tho of the ANU and Mr Jiazhen Xu also of the ANU. Below is a short description of each of their talks in order of presentation, based on the abstracts they provided. And the winner was … really hard to determine! In the end the judges awarded first prize to Zhi Yang , second prize to Swen and third prize to Jiazhen. Congratulations to all three contestants!

1. Fui Swen Kuh: A Holistic Bayesian Framework for Modelling Socio-Economic Health. We propose the novel LAtent Causal Socioeconomic Health (LACSH) index to holistically evaluate a country’s performance from the social, economic, political and environmental aspects to replace the narrowly focused gross domestic product (GDP). Our framework integrates the latent health factor index (LHFI) structure (a latent factor model), spatial modelling to formally account for spatial dependency among the nations and causal modelling to evaluate the impact of a continuous policy variable. We apply our methodology to investigate the causal effect of mandatory maternity leave days and government expenditure on healthcare on a country’s health. We believe this comprehensive approach is the first in the literature to capture a country’s holistic performance while accounting for spatial effects and examining the causal effect of public policy.


2. Zhi Yang Tho: Joint Mean and Correlation Regression Modelling for Multivariate Data. In the analysis of multivariate or multi-response data, researchers are often not only interested in studying how the mean (say) of each response evolves as a function of covariates, but also and simultaneously how the correlations between responses are related to one or more similarity/distance measures. To address such questions, we propose a novel joint mean and correlation regression model that simultaneously regresses the mean of each response against a set of covariates and the correlations between responses against a set of similarity measures, which can be applied to a wide variety of correlated discrete and (semi-)continuous responses. Under a general setting where the number of responses can tend to infinity with the number of clusters, we demonstrate that our proposed joint estimators of the regression coefficients and correlation parameters are consistent and asymptotically normally distributed with differing rates of convergence. We apply the proposed model to a dataset of overdispersed counts of 38 Carabidae ground beetle species sampled throughout Scotland, with results showing in particular that beetle total length and breeding season have statistically important effects in driving the correlations between beetle species.


3. Jiazhen Xu: Generalized Score Matching for Regression. Many probabilistic models that have an intractable normalizing constant may be extended to contain covariates. Since the evaluation of the exact likelihood is difficult or even impossible for these models, we propose score matching to avoid explicit computation of the normalizing constant. In the literature, score matching has so far only been developed for models in which the observations are independent and identically distributed (IID). However, the IID assumption does not hold in the traditional fixed design setting for regression-type models. To deal with the estimation of these covariate-dependent models, we present a new score matching approach for independent but not necessarily identically distributed data under a general framework for both continuous and discrete responses, which includes a novel generalized score matching method for count response regression. We prove that our proposed score matching estimators are consistent and asymptotically normal under mild regularity conditions. The theoretical results are supported by simulation studies and a real-data example. involving doctoral publication data.

What is a scientific instrument?

Dr Cheng Soon Ong is a principal research scientist at the Machine Learning Group in CSIRO Data61. Over 25 statisticians and data scientists came together on Zoom on Tuesday 11 October to hear his presentation to the Canberra Branch of the Statistical Society.

Cheng started with an ice-breaker of sorts, asking the audience about what we thought would be the distribution of marathon finishing times, before showing us a histogram of over 10 million results! From there it was off on a tour of fascinating examples where machine learning has successfully been used, from mapping algal blooms in Lake Burley Griffin to animating maps of the whole of Australia from space. Cheng’s main argument was that machine learning can be regarded as a kind of scientific instrument in its own right.

His final words included a plug for causality, including Judaea Pearl’s book, now into its second edition. I will also be remembering Cheng’s two phases of scientific discovery (observation and experimentation) and the trade-off not between bias and variance but between exploration and exploitation.

Estimating global and country-specific excess mortality during the COVID-19 pandemic

Professor Jon Wakefield is a biostatistician at the University of Washington who presented to the School of Demography on Tuesday 11 October. Over 20 people joined to Zoom seminar, along with 10 in person for the watch party.


He started boldly with a plea for statistical rigour over complexity, and for the distinction between models and algorithms (which is one of my hobby horses too!) I was interested too that he came close to criticising statistics for spending so much energy on the pursuit of efficiency when the “legality” of a method is more important to secure first.


Jon also note that machine learning may not be the “get out of jail free card” that data scientists may be looking for – even the bootstrap has condition attached!


Jon then moved from the general to the specific, and spoke about the COVID modelling. Full national data is only available for 73 out of 194 countries, but they do represent 70% of the worlds population (hopelessly unevenly spread across continents though). After an arduous process of modelling using a range of Poisson, Gamma and Negative Binomial models, the final results of the modelling were released to a veritable media circus in early May 2022.


And the answer? The point estimate of global excess mortality over 2020-21 is 14.8 million with a 95% credible interval of (13.2, 16.6) million. The work will be published in Nature soon, and a methodological paper in the Annals of Applied Statistics.


I really enjoyed Jon’s presentation, he thinks deeply about what he is doing and isn’t afraid to say what he thinks about the state of statistical modelling and inference across the disciplines!