The platform relies on a moving and transforming data. We acheive this using reproducible analytical pipelines or RAPs. You can read more about the general concepts in this blog post about RAPs.
We break our pipelines into three broad stages
This echoes the well-understood ETL (Extract, Transform, Load) pattern pattern of data processing. The third step is renamed Prepare in recognition that we are not often actually loading into a data store. In principle, this could be a preparation for loading if needed.
You can read more about each stage below.
The aim of the extract stage is to get a copy of the data from the source - either a published open data source or potentially an operational system. The source is largely irrelevant, the purpose of this stage is to deal with interfacing as required. This could be:
We can potentially access any source of data that can be accessed programatically.
While many of the sources of data are already open, some of them may not. If we're accessing data from an operational system, we may also have access to sensitive personal information. Consequently, this data will typically be downloaded to a working area which is excluded from git repositories. This means that we do not inadvertently share raw data.
The extract stage should not alter the data, but rather store the data in the working directory in a format that is as close as possible to the source system. This means that we can more easily debug data quality issues later in the process.
The transform stage is responsible for validating, sanitising and transforming the data produced from the extract stage. This may take several forms:
A quick note about granularity: we tend to prefer storing data in as close to a row-level as we can. We can always summarise from this, but it is impossible to reverse this operation. We may, for purposes of preventing leaks of personal information, choose to summarise by an appropriate dimension.
The data stored here is checked in to the appropriate repository. Ideally we would also capture the following data:
The prepare stage is responsible for converting the data into a form that matches the needs of the downstream process. That might include
There are much fewer constraints on this process, as it is difficult to define common processing.
We've found that defining each stage for a given pipeline as a separate script, then orchestrating these using a tool like DVC is a much more optimal and maintainable process.