Variable selection and anomaly detection using PCA

Dr Insha Ullah is pursuing a post-doc in the Research School of Finance, Actuarial Studies and Statistics at ANU, and he gave the RSFAS seminar on Thursday 27 October. About a dozen people attended in person, with another dozen online.

Insha introduced two related pieces of work that showed the use of principal components analysis. The first was around variable selection, particularly in situations where the number of variables, p, far exceeds the number of cases, n. it turns out LASSO regression is not appropriate when n is small, so another approach is needed. Smyth’s moderated t statistics introduced in his 2004 paper have received over 13,000 citations to date and have clearly proven extremely useful in a variety of situations, but they also were not the route Insha wished to pursue.

Insha was able to show through a simulation study that the PCA variable selection method he outlined did very erl for 300 variables and between 10 and 100 cases. He also showed a real-world example with 22800 genes as the variables and 190 cases.

In the second case study, Insha showed how to use PCA for anomaly detection in computer networks. This time the supporting simulation study involved a cool 10000 cases to train the algorithm and 500 to test (with 100 contaminated cases planted in the test data). The real-world example was from a publicly available anomaly detection dataset called KDD’99, involving 5 million cases of which around a quarter were anomalous.

There isn’t an R package for this – Insha explained that the code is not very many lines long and can be found at the end of the papers on arXiv!

Leave a comment