La Ville R(ose)

14 Jul 2019

I just came back from the useR! conference in Toulouse which I enjoyed attending and which I recommend.

Some observations:

The conference was good mix of lectures, coding workshops and short presentations of new packages and papers.
The tidyverse has captured this community. Most code examples use the pipe and dplyr’s functions like select() or mutate() without explanation.
The vibe is closer to what I’m used to from academic conferences and it was less crowded and exuberant than NeurIPS.

I mainly visited talks on time series statistics, movement data and a few others.

Time series statistics

The tidyverts collection of packages for tidy time series analysis is fantastic and should be up on CRAN soon. It contains:

tsibble: An evolution of the tibble for time series purposes (itself an update to the dataframe). The tsibble makes it explicit which variable indexes time (e.g. “year”) and which groups the rows (e.g. “country”) and stores the information which frequency the data is running on.
feasts: New plotting tools tailored to the specific frequencies (e.g. seasons or weekdays) and decompositions of series into cycle and trends using different methods (STL, X11, …).
fable: Time series forecasting with many common methods and new visualization functionalities.

Other time series contributions:

A presentation on using random forests with time series data (paper, presentation). There was a lively discussion on how to create block bootstrap samples (overlapping? Moving windows? Size of blocks?) needed for the forest. The solution from the talk was to use a validation sample on which to test for the optimal block size.
For economists, timeseriesdb - developed at KOF Zurich - might be of interest. It provides a good way to store different vintages of time series and their metadata in a database.
The imputeTS package was presented with an application on sensor data. I could well imagine to use that package in a manufacturing analytics study, such as predictive maintenance. In these cases, you often have high-frequency measurements (such as by minute) for thousands of sensors, but the quality of measurements is hard to judge, outliers are common and sometimes missing data is even implicit (such as returning the last value again and again). The presenter pointed out that in these cases, the missingness is often correlated across variables (e.g. when there’s a factory shutdown, bad weather stopping data transmission, etc).
Anomaly detection got good coverage: The package anomaly (presentation), the stray package and a trivago real-world example with an interesting conclusion (presentation).

Spatial methods

Movement data is currently my favourite kind of data. It’s spreads across time and space, every data point is “weighty”, describing the repositioning of a giant machine or a group of people and you can use interesting methods to analyze such data.

I was therefore happy to see that there is a lot of work going on in creating packages for drawing maps and analyses of movement data. Presentations: 1, 2, 3, 4. The sf package was presented in two workshops (1, 2).

However, I’ve often found this topic quite difficult to start out with in R and I don’t think it’s become much easier yet. I’m still not convinced that I would go this route if I just needed to draw a quick map. A tool like Tableau takes care of all the underlying stuff such as guessing correctly that some column in your data describes US zip codes and draws the right map based on that.

Other: Packages building, data cleaning, big files

Jenny Bryan held a good tutorial on package development and she made her point really well that we should be writing packages much more often.

Hadley Wickham explained how the great tidyr package gets a facelift, renaming spread() to the more expressive pivot_wider() and gather() to pivot_longer.

I was quite impressed by the disk.frame package. It allows splitting a too-large-for-memory dataset into smaller chungs on the local machine and it only pulls in the columns you need. It also allows for quick staggered aggregations, such as calculating the sum of a variable for the different chunks and then taking the sum of that. Interestingly, that wouldn’t work for other functions such as the median.