yaml in data analysis

In familiar areas, it is popular to write settings around machine learning and data analysis in yaml (mainly Kedro is used). Anchor (&) is used as a common setting to make it DRY (don't repeat yourself) as much as possible, but the problem is the yaml specification, where `mapping can be merged but array is merge. I can't do it. It seems that the yaml team does not support this as a yaml specification (https://github.com/yaml/yaml/issues/35 has been launched as an issue, is often opened and closed each time. You can see that).

Specific examples of trouble with yaml

Specifically, I am in trouble in the following situations.

common_features: &common
  - member_reward_program_status
  - member_is_subscribing

transaction_features: &transaction
  - num_transactions
  - average_transaction_amount
  - time_since_last_transaction

next_product_to_buy:
  model_to_use: xgboost
  feature_whitelist:
    - *common
    - *transaction
    - last_product_bought
    - applied_to_campaign
  target: propensity

Imagine you have multiple feature chunks and you want to combine them to create a model. What I want is the contents of feature_whitelist

[
  'member_reward_program_status', 
  'member_is_subscribing', 
  'num_transactions', 
  'average_transaction_amount', 
  'time_since_last_transaction', 
  'last_product_bought', 
  'applied_to_campaign'
]

However, with the above settings, you will end up with a nested list like the one below.

[
  [
    'member_reward_program_status', 
    'member_is_subscribing', 
  ],
  [
    'num_transactions', 
    'average_transaction_amount', 
    'time_since_last_transaction', 
  ],
  'last_product_bought', 
  'applied_to_campaign'
]

Define yaml tag

This time there was a background that I wanted to use it to extend the functionality of Kedro. Kedro uses anyconfig to load TemplatedConfig, and anyconfig itself seems to support both PyYAML and ruamel.yaml, but the Kedro side specifies PyYAML as a requirement. So let's think about how to do it with PyYAML.

Official Docs also has some explanation about the implementation of own tags, so refer to that and define the constructor for the tags.

import yaml

yaml.add_constructor("!flatten", construct_flat_list)

def construct_flat_list(loader: yaml.Loader, node: yaml.Node) -> List[str]:
    """Make a flat list, should be used with '!flatten'
       
    Args:
        loader: Unused, but necessary to pass to `yaml.add_constructor`
        node: The passed node to flatten
    """
    return list(flatten_sequence(node))
    
def flatten_sequence(sequence: yaml.Node) -> Iterator[str]:
    """Flatten a nested sequence to a list of strings
        A nested structure is always a SequenceNode
    """
    if isinstance(sequence, yaml.ScalarNode):
        yield sequence.value
        return
    if not isinstance(sequence, yaml.SequenceNode):
        raise TypeError(f"'!flatten' can only flatten sequence nodes, not {sequence}")
    for el in sequence.value:
        if isinstance(el, yaml.SequenceNode):
            yield from flatten_sequence(el)
        elif isinstance(el, yaml.ScalarNode):
            yield el.value
        else:
            raise TypeError(f"'!flatten' can only take scalar nodes, not {el}")

PyYAML creates a document that parses yaml into a PyYAML object before creating a Python object, but in that document all arrays are stored as yaml.SequenceNode and the values are stored as yaml.ScalarNode. So you can recursively retrieve only the value with the above code. The test code to check the function is as follows. You can convert a nested array to a flat array by tagging it with ! Flatten.

import pytest
def test_flatten_yaml():
    # single nest
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      - *bread
    midnight_meal: !flatten
      - *chicken
      - *bread
    """
    params = yaml.load(param_string)
    assert sorted(params["midnight_meal"]) == sorted(
        ["toast", "loafs", "toast", "loafs"]
    )

    # double nested
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      - *bread
    dinner: &dinner
      - *chicken
      - *bread
    midnight_meal_long:
      - *chicken
      - *bread
      - *dinner
    midnight_meal: !flatten
      - *chicken
      - *bread
      - *dinner
    """
    params = yaml.load(param_string)
    assert sorted(params["midnight_meal"]) == sorted(
        ["toast", "loafs", "toast", "loafs", "toast", "loafs", "toast", "loafs"]
    )

    # doesn't work with mappings
    param_string = """
    bread: &bread
      - toast
      - loafs
    chicken: &chicken
      meat: breast
    midnight_meal: !flatten
      - *chicken
      - *bread
    """
    with pytest.raises(TypeError):
        yaml.load(param_string)

I'm glad if you can use it as a reference.

[PYTHON] Merge array with PyYAML

yaml in data analysis

Specific examples of trouble with yaml

Other solutions

Define yaml tag