Unit 3: Advanced Recoding with SDMX

Recoding of data can be very complex and SDMX and FMR respond to this complexity by supporting the use of regular expressions (regex) to specify recoding rules. This unit explains how.

Regular expressions

Previously we said that representation maps can include complex rules, such as regular expressions (regex) on source values and can even define periods of time for which a mapping relationship is true. For example,

if a relationship between the source country and target currency is defined, then
one could map France to the French Franc up until 2002,
and then map France to the Euro from 2002 onwards.

If values do not require mapping, for example if the source FREQ maps to target FREQ and the values are the same in both the source and target DSD, then the component map should not link to a representation map. The lack of a link will inform the system that the value should be copied across verbatim.

Order of rules

The order of rules in a representation map can be important, specifically when using regular expressions. The regular expressions are tested in the same order that they appear, this allows for more specific expressions to be tested before a general catch all.

Example

Rule 1 (no regex): A -> B (A maps to B)
Rule 2 (reg ex): A\dB -> B2 (A followed by a number followed by B maps to B2)
Rule 3 (reg ex): .* -> _Z (anything maps to _Z)

Source data will first be checked against the exact match rule (A maps to B), followed by each regular expression rule until a match is found. As the last expression matched on anything, this can be considered as ‘if nothing matches then output _Z’.

Regular expression: Example use cases

A rule can be defined to match a specific pattern, which is then used in the output. An additional use case is where there is more than one source component for a mapping.

Select each example use case for the details.

Use case example 1

A rule can be defined to match a specific pattern, which is then used in the output.

For example, the rule can state any three characters followed by a number is converted to the same three characters without the number. This can be satisfied by using regular expressions to match the input, with a capture group. A capture group is where the regular expression rule is in parentheses which can then be referred to by number (capture group 1, 2, 3, and so on).

Example

RegEx Input: ([A-Z]{3})_([0-9])
Output Expression: \2_\1

This example consists of two capture groups:

([A-Z]{3}) Any A to Z character 3 times
(0-9) the number zero to nine

The output expression then reverses the order of the information by outputting capture group 2, an underscore, followed by capture group 1.

An example input for the above expression, and corresponding output is as follows:

Input = ABC_1
Output = 1_ABC

Use case example 2

An additional use case is where there is more than one source component for a mapping e.g. CURRENCY and REF_AREA. The value for one of the source components is based on a pattern and the value for the second component is based on what matched the first pattern.

Like the pattern match on output, this rule makes use of regular expression capture groups to copy matched information from one rule to another. The capture group (everything matched in the parenthesis) is referred to by number, with a leading slash .

Example

CURRENCY = (.*)
REF_AREA = \1_X

In this example the first input matches on anything, but the REF_AREA rule is using the matched value from CURRENCY, defined by capture group \1, followed by _X. The following shows a match and a miss:

Match: CURRENCY=USD, REF_AREA=USD_X
Miss: CURRENCY_USD, REF_AREA=US_X

Reshaping a dataset – from wide to tidy

To reshape a dataset from wide to tidy, we take source data with multiple measures e.g. BIRTHS, DEATHS, MARRIAGES and convert it to a DSD with only one OBS_VALUE.

In the following example, the measure is converted into a dimension value i.e. INDICATOR.

Select the options below for the details.

Source data

The table below shows an example source dataset with three measures, BIRTHS, DEATHS, MARRIAGES.

REF_AREA	TIME	BIRTHS	DEATHS	MARRIAGES
UK	2020	11	12	13
FR	2020	21	22	23

The dataset should be mapped to convert the BIRTHS, DEATHS and MARRIAGES to the INDICATOR B, D, and M respectively. The observation value is the value of each corresponding measure.

Desired output

Desired output – Tidy data, one observation per row

REF_AREA	INDICATOR	TIME_PERIOD	OBS_VALUE
UK	B	2020	11
UK	D	2020	12
UK	M	2020	13
FR	B	2020	21
FR	D	2020	22
FR	M	2020	23

This mapping relationship can be defined by mapping each source MEASURE to both the OBS_VALUE component and the INDICATOR component.

Example

The rule for the INDICATOR mapping should be a single 'catch all' regular expression, which maps the particular measure to a fixed value.

Example
Component Map 1: BIRTHS maps to OBS_VALUE
Component Map 2: BIRTHS maps to INDICATOR (uses Births Representation Map)
Births Representation Map: source=[anything], target=B
Where the [anything] rule is simply the regular expression .*

Component Map 3: DEATHS maps to OBS_VALUE
Component Map 4: DEATHS maps to INDICATOR (uses Deaths Representation Map)
Deaths Representation Map: source=.*, target=D

Component Map 5: MARRIAGES maps to OBS_VALUE
Component Map 6: MARRIAGES maps to INDICATOR (uses Marriages Representation Map)
Marriages Representation Map: source=.*, target=M

Coming next …

Recoding capabilities in SDMX are extremely powerful and the possibilities endless. The best way to become familiar with SDMX capabilities for recoding and transforming data is to experiment with both simple and more complex transformation scenarios.

In doing so, it is important to be able to test and validate the transformations against different test cases (test datasets). The next unit provides more information on testing and automating SDMX data transformations.