Research software engineers: how will they shape statistics?

What a fascinating topic for the Canberra and Victoria Branches of the SSA to tackle, along with the SSA Statistical Computing and Visualisation Section and the Australian Research Data Commons! Nearly 70 people Zoomed in on Tuesday 26 April to hear the panellists.

The event was discussion-driven, no talks, and the four presenters each brought an interesting angle to the conversation. There was Maelle Salmon from R OpenSci, Paula Andrea Martinez from ARDC, Thomas Lumley from the University of Auckland and Nick Golding of Telethon Kids’ Institute / Curtin University, with Kim-Anh Le Cao in the chair.

A growing number of people in academia combine their (perhaps informally gained) programming skills with their research expertise. This led to the establishment of the UK Research Software Engineering Association in 2013, which coined the term “Research Software Engineer” (RSE) to represent this community. RSEs do not have to have a formal training in software engineering and are often embedded across different disciplines, without perhaps being named as RSEs.

Research software are key pillars that underpin the modern landscape of statistics and data analysis, yet those who develop these tools often lack formal recognition or reward in the academic system. The panel tackled questions such as: does the academic system need to be modified to retain these talents? If so, how should the research and education system be modified to accommodate the growth of RSEs? And, how will RSEs contribute to shaping the field of statistics?

I had one useful takeaway: this reference from Paula on how to make research software citeable. The discussion and links to various resources was one of the highlights of the presentation.

Statistical methods for small area estimation of cancer risk factors and their application to measure area-level cancer risk across Australia

Wednesday 20 April: it was great to join over 20 people online and an unknown number in person at the Queensland University of Technology to hear Jamie Hogg present his PhD confirmation seminar.

His project consists of three main pieces of work. First is to estimate the prevalence of risk factors ranging from smoking and obesity to alcohol, diet and physical activity, at the SA2 level across Australia. Sounds simple right? The data restrictions and Bayesian small area estimation complexities mean this has been no small task thus far!

Second is to compare the spatial variation in these risk factors, and look at ecological associations. Third will be to combine all the risk factors together to com up with area level cancer risk indices.

Jamie has been making great inroads into the literature on this topic, and some of the concepts and papers that caught my eye were these three. Gotta love a pair of methods with slightly silly names!

Kennedy & Gelman (2021). “Know your population and know your model: Using model-based regression and poststratification to generalize findings beyond the observed sample.”

MrP: Gelman & Little (1997). “Poststratification Into Many Categories Using Hierarchical Logistic Regression.”.

MrsP: Leemann & Wasserfallen (2017). “Extending the Use and Prediction Precision of Subnational Public Opinion Estimation.”

Statistical analysis of machine learning methods

It might have been 4pm on the Thursday before Easter (14 April) but this talk attracted close to 30 attendees on Zoom – testament to the drawing power of the speaker and title!

professor Johannes Schmidt-Hieber of the University of Twente in the Netherlands presented a topic of interest to academics and Government statisticians from several states. I’ve been wondering about this topi for a long time too.

Johannes began by noting the fundamental difference between machine learning and statistics. In machine learning, there’s an objective function to be minimised on the training data over some parameters (as always let’s call them theta). In statistics, there’s one more key ingredient – the distribution of the data.

Johannes then focused in on two machine learning methods. First was neural networks and deep learning. He showed that they could be written in the form of a statistical model, much like a nonparametric regression, along with a two-part formula for the prediction error. Secondly, working in the opposite direction, he described how multivariate adaptive regression splines could also be represented by a sparse neural network.

These innovations raised the question then – is it going to be possible to defy the bias-variance trade-off that seems to be unavoidable in statistics? Johannes answered No, while pointing to some interesting literature and debate around this topic from the late 2010s.

In conclusion, a thought-provoking way to lead into the Easter holidays!

From Me to team: learning to code collaboratively

I’m still really enjoying the access to RLadies events organised by multiple chapters through Zoom. On Wednesday April 13 it was RLadies Auckland who hosted this event, presented by Victoria King, a Psychology PhD candidate from the University of Auckland.

She opened with some background about the nature of PhD research. Her gist was pretty much the same as I was told many many years ago: it’s like pushing against the boundaries of knowledge, as much as pushing them back! She also introduced her own PhD topic and showed some fascinating “first attempt” visualisations of her data, followed up by “the current iteration”, much improved. I’m also a great fan of building graphics bottom-up, starting with the basic structure and enhancing it.

Everyone’s favourite analogy from Victoria’s talk was that coding collaboratively is like working in a shared workshop garage. It’s just plain polite to leave it as tidy as you can for the other users. Victoria’s four principles of collaborative coding expanded on this notion, encompassing the qualities of consistency, clarity, kindness an empathy.

Everyone agreed with Victoria that messy code shared is better than no code shared, and the notion of the Community Research and Academic Programming licence (check out those initials!) was a good idea. Victoria also pointed out quite rightly that while data may be confidential and not able to be shared, the code is very often not going to breach those principles and can be shared.

Her final thoughts were around collaboration on other sorts of output, not just code, such as documentation or a wiki. And this link for pithy reflections.

Data viz for mass audiences

RLadies Melbourne ran a meetup online on Thursday 7 April, with over 25 attendees. Juliette O’Brien is a digital and data journalist, creator or covid19data.com.au. She currently teaches Data and Computational Journalism at the University of Technology Sydney and gave the presentation.

This was a great talk about her experiences in setting up and running the COVID19 website above. Underneath the exciting diagrams there’s a range of software, from the familiar such as Google Docs, Twitter and Mailchimp to the less familiar (to me) such as Infogram, Datawrapper and Flourish.

Juliette spoke of the challenges in maintaining the site, such as changing data, patch release patterns and variation between jurisdictions. She came up with several learnings form the experience, a couple of which really spoke to me. The notion of distinguishing between tracking, projections and modelling relates very much to my campaign to distinguish between models, methods and algorithms. The notion that the harder option is often the better potion also struck me as being applicable to a number of statistical projects that I have been involved in.

Expectations naturally rose as the site matured. Juliette spoke about the desires for timeliness, speed, transparency, history, interactivity and story-telling that started to weigh on the production of the site.

As always, it’s great to take away some new ideas form a presentation such as this. I earnt about sparklines, those tiny scraps of time series which can hold a surprising amount of comparative detail. I was also reminded of the age-old notion that grunt work can actually be your competitive difference.